Getting started with your Lab Sandbox

You can find important course-specific tips and notes for your Lab Sandbox in this
quick guide to use throughout your course. You’ll be able to reference this at any time
or visit the Learner Help Center for more info.

Throughout this course, you'll encounter datasets which are hosted on other websites or are linked from the course instructional materials. If you'd like to complete your work in the lab sandbox environment, please download these datasets from their listed websites and upload the data files directly into your RStudio lab environment. Lab Sandboxes have limited access to external sites, so uploading your data files directly will help ensure you do not encounter any access errors.

What tools are already installed in my sandbox environment?

  • Python==3.7.6
  • turicreate==6.4.1
  • scikit-learn(same as sklearn)==0.22.2.post1
  • pandas==1.0.3
  • numpy==1.18.4
  • matplotlib==3.2.1
  • unzip for uploading/unpacking data and files via Jupyter's built-in Terminal

Additional Course Specific Notes:

  • IMPORTANT: To optimize your notebook performance, please shut down each of your
    open/running Jupyter Notebooks after you complete them within your Lab
    .
    This will provide full and dedicated system resources to the current lab assignment
    you are working on in your Lab Sandbox. If you experience any unexpected slow
    processing in your notebook, try to restart your Lab kernel and only open that single
    notebook to complete your work. You can learn more about restarting your Lab Sandbox
    and troubleshooting your lab in our Learner Help Center article here.

  • Ensure you are using Python3 to complete the assessments. In Python3, the print statement syntax
    is print("Hello") rather than print "Hello" as stated in the solution files

  • Upload the unzipped files to the Labs Sandbox.

  • It is highly advised to practice with turicreate if you use this sandbox as it can be used on large datasets.
    Turicreate and python3 will be the preferred/supported route for the sandbox,
    to enable a newer version of python and avoid compatibility issues between versions.
    If you prefer to have Python2 installed in your local system, you can use Sframe or graphlab locally as well.
    You are still able to complete all assignments locally if you would prefer using other tools.

  • As an important note, the sandbox environment has limited access to the internet, so you will not be
    able to install additional packages into this sandbox. However, the required dependencies for the Python3,
    Jupyter Notebook, and turicreate path are pre-installed. This should still allow you to complete your coursework
    in the sandbox if you are unable to complete it locally. If you find that you are blocked using available tools in this
    sandbox at any point, please reach out to Coursera through the Learner Help Center with your feedback!

  • Decide on which tool you'd like to use at the very beginning of the course. It is highly advisable
    not to switch between turicreate and other Python libraries such as sklearn or Pandas.

  • If anything is unclear regarding the course content, please go through the
    Discussion board and see if your question has been answered. If you have any issues
    with the Jupyter enviornment itself, please contact Coursera through the Learner Help Center.

Important Note:


  • To use SFrame, kindly import the package turicreate, as the prior standalone SFrame package is not compatible with Python 3.X.

  • Run the following code to get started: from turicreate import sframe

  • Happy Learning!

    In [1]:
    ! pip install turicreate==6.4.1
    
    Requirement already satisfied: turicreate==6.4.1 in /opt/conda/lib/python3.7/site-packages (6.4.1)
    Requirement already satisfied: coremltools==3.3 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (3.3)
    Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (1.18.4)
    Requirement already satisfied: decorator>=4.0.9 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (4.4.2)
    Requirement already satisfied: pandas>=0.23.2 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (1.0.3)
    Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (1.14.0)
    Requirement already satisfied: numba<0.51.0 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (0.48.0)
    Requirement already satisfied: pillow>=5.2.0 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (7.1.2)
    Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (1.4.1)
    Requirement already satisfied: tensorflow<2.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (2.0.4)
    Requirement already satisfied: prettytable==0.7.2 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (0.7.2)
    Requirement already satisfied: resampy==0.2.1 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (0.2.1)
    Requirement already satisfied: requests>=2.9.1 in /opt/conda/lib/python3.7/site-packages (from turicreate==6.4.1) (2.23.0)
    Requirement already satisfied: protobuf>=3.1.0 in /opt/conda/lib/python3.7/site-packages (from coremltools==3.3->turicreate==6.4.1) (3.11.4)
    Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from numba<0.51.0->turicreate==6.4.1) (46.1.3.post20200325)
    Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba<0.51.0->turicreate==6.4.1) (0.31.0)
    Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.2->turicreate==6.4.1) (2.8.1)
    Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.2->turicreate==6.4.1) (2020.1)
    Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.9.1->turicreate==6.4.1) (3.0.4)
    Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.9.1->turicreate==6.4.1) (2.9)
    Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.9.1->turicreate==6.4.1) (1.25.9)
    Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.9.1->turicreate==6.4.1) (2020.4.5.1)
    Requirement already satisfied: keras-applications>=1.0.8 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.0.8)
    Requirement already satisfied: opt-einsum>=2.3.2 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (3.3.0)
    Requirement already satisfied: tensorflow-estimator<2.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (2.0.1)
    Requirement already satisfied: wrapt>=1.11.1 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.12.1)
    Requirement already satisfied: google-pasta>=0.1.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.2.0)
    Requirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.1.0)
    Requirement already satisfied: keras-preprocessing>=1.0.5 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.1.2)
    Requirement already satisfied: h5py<=2.10.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (2.10.0)
    Requirement already satisfied: wheel>=0.26 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.34.2)
    Requirement already satisfied: tensorboard<2.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (2.0.2)
    Requirement already satisfied: absl-py>=0.7.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.13.0)
    Requirement already satisfied: gast==0.2.2 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.2.2)
    Requirement already satisfied: grpcio>=1.8.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.40.0)
    Requirement already satisfied: astor>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.8.1)
    Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (3.3.4)
    Requirement already satisfied: google-auth<2,>=1.6.3 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.35.0)
    Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.4.6)
    Requirement already satisfied: werkzeug>=0.11.15 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (2.0.1)
    Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.2.8)
    Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (4.2.2)
    Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (4.7.2)
    Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.3.0)
    Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (1.6.0)
    Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (0.4.8)
    Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (3.0.1)
    Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate==6.4.1) (3.1.0)
    WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
    You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.
    
    In [ ]:
    !unzip people_wiki.sframe.zip
    
    Archive:  people_wiki.sframe.zip
    replace people_wiki.sframe/m_cf05efad0f89a530.frame_idx? [y]es, [n]o, [A]ll, [N]one, [r]ename: 
    In [3]:
    from __future__ import print_function # to conform python 2.x print to python 3.x
    import turicreate
    import matplotlib.pyplot as plt
    import numpy as np
    %matplotlib inline
    
    In [7]:
    wiki = turicreate.SFrame('people_wiki.sframe')
    
    In [8]:
    wiki
    
    Out[8]:
    URI name text
    <http://dbpedia.org/resou
    rce/Digby_Morrell> ...
    Digby Morrell digby morrell born 10
    october 1979 is a former ...
    <http://dbpedia.org/resou
    rce/Alfred_J._Lewy> ...
    Alfred J. Lewy alfred j lewy aka sandy
    lewy graduated from ...
    <http://dbpedia.org/resou
    rce/Harpdog_Brown> ...
    Harpdog Brown harpdog brown is a singer
    and harmonica player who ...
    <http://dbpedia.org/resou
    rce/Franz_Rottensteiner> ...
    Franz Rottensteiner franz rottensteiner born
    in waidmannsfeld lower ...
    <http://dbpedia.org/resou
    rce/G-Enka> ...
    G-Enka henry krvits born 30
    december 1974 in tallinn ...
    <http://dbpedia.org/resou
    rce/Sam_Henderson> ...
    Sam Henderson sam henderson born
    october 18 1969 is an ...
    <http://dbpedia.org/resou
    rce/Aaron_LaCrate> ...
    Aaron LaCrate aaron lacrate is an
    american music producer ...
    <http://dbpedia.org/resou
    rce/Trevor_Ferguson> ...
    Trevor Ferguson trevor ferguson aka john
    farrow born 11 november ...
    <http://dbpedia.org/resou
    rce/Grant_Nelson> ...
    Grant Nelson grant nelson born 27
    april 1971 in london ...
    <http://dbpedia.org/resou
    rce/Cathy_Caruth> ...
    Cathy Caruth cathy caruth born 1955 is
    frank h t rhodes ...
    [59071 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [89]:
    wiki['word_count'] = turicreate.text_analytics.count_words(wiki['text'])
    
    In [10]:
    model = turicreate.nearest_neighbors.create(wiki, label='name', features=['word_count'],
                                                method='brute_force', distance='euclidean')
    
    Starting brute force nearest neighbors model training.
    Validating distance components.
    Initializing model data.
    Initializing distances.
    Done.
    In [11]:
    model.query(wiki[wiki['name']=='Barack Obama'], label='name', k=10)
    
    Starting pairwise querying.
    +--------------+---------+-------------+--------------+
    | Query points | # Pairs | % Complete. | Elapsed Time |
    +--------------+---------+-------------+--------------+
    | 0            | 1       | 0.00169288  | 3.948ms      |
    | Done         |         | 100         | 407.18ms     |
    +--------------+---------+-------------+--------------+
    Out[11]:
    query_label reference_label distance rank
    Barack Obama Barack Obama 0.0 1
    Barack Obama Joe Biden 33.075670817082454 2
    Barack Obama George W. Bush 34.39476704383968 3
    Barack Obama Lawrence Summers 36.15245496505044 4
    Barack Obama Mitt Romney 36.16628264005025 5
    Barack Obama Francisco Barrio 36.3318042491699 6
    Barack Obama Walter Mondale 36.40054944640259 7
    Barack Obama Wynn Normington Hugh-
    Jones ...
    36.49657518178932 8
    Barack Obama Don Bonker 36.6333181680284 9
    Barack Obama Andy Anstett 36.959437225152655 10
    [10 rows x 4 columns]
    In [12]:
    def top_words(name):
        """
        Get a table of the most frequent words in the given person's wikipedia page.
        """
        row = wiki[wiki['name'] == name]
        word_count_table = row[['word_count']].stack('word_count', new_column_name=['word','count'])
        return word_count_table.sort('count', ascending=False)
    
    In [13]:
    obama_words = top_words('Barack Obama')
    obama_words
    
    Out[13]:
    word count
    the 40.0
    in 30.0
    and 21.0
    of 18.0
    to 14.0
    his 11.0
    obama 9.0
    act 8.0
    a 7.0
    he 7.0
    [273 rows x 2 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [14]:
    barrio_words = top_words('Francisco Barrio')
    barrio_words
    
    Out[14]:
    word count
    the 36.0
    of 24.0
    and 18.0
    in 17.0
    he 10.0
    to 9.0
    chihuahua 7.0
    governor 6.0
    a 6.0
    as 5.0
    [225 rows x 2 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [15]:
    combined_words = obama_words.join(barrio_words, on='word')
    combined_words
    
    Out[15]:
    word count count.1
    the 40.0 36.0
    in 30.0 17.0
    and 21.0 18.0
    of 18.0 24.0
    to 14.0 9.0
    his 11.0 5.0
    a 7.0 6.0
    he 7.0 10.0
    as 6.0 5.0
    was 5.0 4.0
    [56 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [16]:
    combined_words = combined_words.rename({'count':'Obama', 'count.1':'Barrio'})
    combined_words
    
    Out[16]:
    word Obama Barrio
    the 40.0 36.0
    in 30.0 17.0
    and 21.0 18.0
    of 18.0 24.0
    to 14.0 9.0
    his 11.0 5.0
    a 7.0 6.0
    he 7.0 10.0
    as 6.0 5.0
    was 5.0 4.0
    [56 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [19]:
    combined_words.sort('Obama', ascending=False)[:5]
    
    Out[19]:
    word Obama Barrio
    the 40.0 36.0
    in 30.0 17.0
    and 21.0 18.0
    of 18.0 24.0
    to 14.0 9.0
    [5 rows x 3 columns]
    In [24]:
    common_words = set(combined_words.sort('Obama', ascending=False)[:5]['word']) # YOUR CODE HERE
    
    def has_top_words(word_count_vector):
        # extract the keys of word_count_vector and convert it to a set
        unique_words = set(word_count_vector.keys())   # YOUR CODE HERE
        # return True if common_words is a subset of unique_words
        # return False otherwise
        return common_words.issubset(unique_words)  # YOUR CODE HERE
    
    wiki['has_top_words'] = wiki['word_count'].apply(has_top_words)
    
    # use has_top_words column to answer the quiz question
    ... # YOUR CODE HERE
    wiki['has_top_words'].sum()
    
    Using default 16 lambda workers.
    To maximize the degree of parallelism, add the following code to the beginning of the program:
    "turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)"
    Note that increasing the degree of parallelism also increases the memory footprint.
    Out[24]:
    56066
    In [25]:
    len(wiki['has_top_words'])
    
    Out[25]:
    59071
    In [26]:
    print('Output from your function:', has_top_words(wiki[32]['word_count']))
    print('Correct output: True')
    print('Also check the length of unique_words. It should be 167')
    print(len(wiki[32]['word_count']))
    
    Output from your function: True
    Correct output: True
    Also check the length of unique_words. It should be 167
    167
    
    In [27]:
    print('Output from your function:', has_top_words(wiki[33]['word_count']))
    print('Correct output: False')
    print('Also check the length of unique_words. It should be 188')
    print(len(wiki[33]['word_count']))
    
    Output from your function: False
    Correct output: False
    Also check the length of unique_words. It should be 188
    188
    
    In [31]:
    o = wiki[wiki['name'] == 'Barack Obama']['word_count'][0]
    b = wiki[wiki['name'] == 'George W. Bush']['word_count'][0]
    j = wiki[wiki['name'] == 'Joe Biden']['word_count'][0]
    turicreate.toolkits.distances.euclidean(o, b)
    
    Out[31]:
    34.39476704383968
    In [32]:
    turicreate.toolkits.distances.euclidean(o, j)
    
    Out[32]:
    33.075670817082454
    In [33]:
    turicreate.toolkits.distances.euclidean(j, b)
    
    Out[33]:
    32.7566787083184
    In [34]:
    bush_words = top_words('George W. Bush')
    combined_words = obama_words.join(bush_words, on='word')
    combined_words = combined_words.rename({'count':'Obama', 'count.1':'Bush'})
    combined_words.sort('Obama', ascending=False)
    
    Out[34]:
    word Obama Bush
    the 40.0 39.0
    in 30.0 22.0
    and 21.0 14.0
    of 18.0 14.0
    to 14.0 11.0
    his 11.0 6.0
    act 8.0 3.0
    a 7.0 6.0
    he 7.0 8.0
    as 6.0 6.0
    [86 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [90]:
    wiki['tf_idf'] = turicreate.text_analytics.tf_idf(wiki['word_count'])
    
    In [91]:
    model_tf_idf = turicreate.nearest_neighbors.create(wiki, label='name', features=['tf_idf'],
                                                       method='brute_force', distance='euclidean')
    
    Starting brute force nearest neighbors model training.
    Validating distance components.
    Initializing model data.
    Initializing distances.
    Done.
    In [92]:
    model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=10)
    
    Starting pairwise querying.
    +--------------+---------+-------------+--------------+
    | Query points | # Pairs | % Complete. | Elapsed Time |
    +--------------+---------+-------------+--------------+
    | 0            | 1       | 0.00169288  | 13.071ms     |
    | Done         |         | 100         | 467.194ms    |
    +--------------+---------+-------------+--------------+
    Out[92]:
    query_label reference_label distance rank
    Barack Obama Barack Obama 0.0 1
    Barack Obama Phil Schiliro 106.86101369140928 2
    Barack Obama Jeff Sessions 108.87167421571078 3
    Barack Obama Jesse Lee (politician) 109.04569790902957 4
    Barack Obama Samantha Power 109.10810616502708 5
    Barack Obama Bob Menendez 109.78186710530215 6
    Barack Obama Eric Stern (politician) 109.9577880796839 7
    Barack Obama James A. Guest 110.4138887175989 8
    Barack Obama Roland Grossenbacher 110.47060870018984 9
    Barack Obama Tulsi Gabbard 110.6969979988001 10
    [10 rows x 4 columns]
    In [93]:
    def top_words_tf_idf(name):
        row = wiki[wiki['name'] == name]
        word_count_table = row[['tf_idf']].stack('tf_idf', new_column_name=['word','weight'])
        return word_count_table.sort('weight', ascending=False)
    
    In [94]:
    obama_tf_idf = top_words_tf_idf('Barack Obama')
    obama_tf_idf
    
    Out[94]:
    word weight
    obama 43.2956530720749
    act 27.67822262297991
    iraq 17.747378587965535
    control 14.887060845181308
    law 14.722935761763422
    ordered 14.533373950913514
    military 13.115932778499415
    involvement 12.784385241175055
    response 12.784385241175055
    democratic 12.410688697332166
    [273 rows x 2 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [95]:
    schiliro_tf_idf = top_words_tf_idf('Phil Schiliro')
    schiliro_tf_idf
    
    Out[95]:
    word weight
    schiliro 21.972990778450388
    staff 15.856441635180534
    congressional 13.547087656327776
    daschleschiliro 10.986495389225194
    obama 9.621256238238866
    waxman 9.04058524016988
    president 9.033586614158258
    2014from 8.683910296231149
    law 7.361467880881711
    consultant 6.913104037247212
    [119 rows x 2 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [97]:
    combined_words = obama_tf_idf.join(schiliro_tf_idf, on='word')
    combined_words = combined_words.rename({'weight':'Obama', 'weight.1':'Schiliro'})
    combined_words.sort('Obama', ascending=False)
    
    Out[97]:
    word Obama Schiliro
    obama 43.2956530720749 9.621256238238866
    law 14.722935761763422 7.361467880881711
    democratic 12.410688697332166 6.205344348666083
    senate 10.164288179703693 3.3880960599012306
    presidential 7.386955418904825 3.6934777094524125
    president 7.226869291326606 9.033586614158258
    policy 6.095386282141427 3.0476931410707135
    states 5.473200989631017 1.824400329877006
    office 5.2481728232196465 2.6240864116098233
    2011 5.107041270312876 3.4046941802085837
    [47 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [98]:
    common_words = set(combined_words.sort('Obama', ascending=False)[:5]['word']) # YOUR CODE HERE
    
    def has_top_words(word_count_vector):
        # extract the keys of word_count_vector and convert it to a set
        unique_words = set(word_count_vector.keys())   # YOUR CODE HERE
        # return True if common_words is a subset of unique_words
        # return False otherwise
        return common_words.issubset(unique_words)  # YOUR CODE HERE
    
    wiki['has_top_words'] = wiki['word_count'].apply(has_top_words)
    
    # use has_top_words column to answer the quiz question
    ... # YOUR CODE HERE
    wiki['has_top_words'].sum()
    
    # use has_top_words column to answer the quiz question
    # ...  # YOUR CODE HERE
    
    Out[98]:
    14
    In [99]:
    common_words
    
    Out[99]:
    {'democratic', 'law', 'obama', 'presidential', 'senate'}
    In [100]:
    biden_tf_idf = top_words_tf_idf('Joe Biden')
    
    In [101]:
    biden_tf_idf
    
    Out[101]:
    word weight
    biden 63.92610492536963
    obama 19.24251247647773
    act 17.298889139362444
    vice 15.355736099810581
    resolved 13.135309562857193
    senator 11.716882477603237
    delaware 11.396456717061318
    judiciary 11.011712931766406
    dosf 10.986495389225194
    thomasbiden 10.986495389225194
    [219 rows x 2 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [102]:
    combined_words = obama_tf_idf.join(biden_tf_idf, on='word')
    combined_words['weight'], combined_words['weight.1']
    
    Out[102]:
    (dtype: float
     Rows: 84
     [43.2956530720749, 27.67822262297991, 17.747378587965535, 14.887060845181308, 14.722935761763422, 13.115932778499415, 12.410688697332166, 11.591942692842837, 10.164288179703693, 9.43101391473379, 9.319341564760851, 8.907053847545358, 8.842460838379667, 7.712676160711769, 7.654290879049991, 7.431147327735781, 7.386955418904825, 7.226869291326606, 7.0620334604226676, 6.642689967371511, 6.642689967371511, 6.3816977057812005, 6.095386282141427, 5.656236009557883, 5.6158573610975315, 5.592867842872833, 5.473200989631017, 5.107041270312876, 5.067601534952048, 4.88376320446593, 4.831637295208776, 4.703766236011668, 4.693309450812809, 4.594578275832593, 4.545548848592274, 4.523465932304524, 4.327201469541557, 4.176352939110058, 4.03568062078261, 4.015765311081669, 3.9453132752336004, 3.9140734886878232, 3.7697859025157365, 3.7451291059028766, 3.723866788250953, 3.68265216394749, 3.6593720969659014, 3.446936559924164, 3.445873860568042, 3.3821333532750204, 3.344851662973069, 3.3244978303233013, 3.185667920243947, 3.0725446998610506, 2.896399606044235, 2.8887260073502303, 2.8308461188591933, 2.809822617276739, 2.634930295807099, 2.0868146141979307, 2.0079609791418744, 1.7938099524877322, 1.713990158976156, 1.5093391374786154, 1.4967823726683713, 1.493579903611068, 1.0752380994247055, 0.8871532656125274, 0.8812660139569034, 0.7630171320744707, 0.6614069466714981, 0.6572291275451891, 0.6074059275661821, 0.53639254752953, 0.43063857330825733, 0.3968289280609173, 0.36882550670120073, 0.29145011737314763, 0.07481117158400744, 0.05523250095103998, 0.039334291308082026, 0.028962190503643476, 0.01564802185902329, 0.004063113702956533],
     dtype: float
     Rows: 84
     [19.24251247647773, 17.298889139362444, 4.436844646991384, 7.443530422590654, 2.4538226269605703, 3.2789831946248538, 9.308016522999125, 9.659952244035697, 10.164288179703693, 4.715506957366895, 3.1064471882536173, 8.907053847545358, 8.842460838379667, 5.14178410714118, 7.654290879049991, 7.431147327735781, 7.386955418904825, 9.033586614158258, 4.708022306948445, 6.642689967371511, 6.642689967371511, 3.1908488528906003, 3.0476931410707135, 3.770824006371922, 5.6158573610975315, 5.592867842872833, 5.473200989631017, 1.7023470901042919, 10.135203069904096, 2.441881602232965, 4.831637295208776, 4.703766236011668, 1.5644364836042695, 4.594578275832593, 4.545548848592274, 4.523465932304524, 4.327201469541557, 4.176352939110058, 8.07136124156522, 4.015765311081669, 1.9726566376168002, 3.9140734886878232, 3.7697859025157365, 3.7451291059028766, 3.723866788250953, 7.36530432789498, 3.6593720969659014, 3.446936559924164, 3.445873860568042, 1.1273777844250068, 1.6724258314865346, 3.3244978303233013, 1.5928339601219734, 1.5362723499305253, 2.896399606044235, 1.3130572760682866, 2.8308461188591933, 8.429467851830218, 2.634930295807099, 2.0868146141979307, 2.0079609791418744, 5.381429857463196, 0.856995079488078, 3.0186782749572307, 1.4967823726683713, 1.493579903611068, 2.150476198849411, 0.8871532656125274, 0.8812660139569034, 0.2543390440248236, 3.3070347333574905, 0.5163943144997915, 0.6074059275661821, 0.53639254752953, 0.43063857330825733, 0.6349262848974677, 0.18441275335060037, 0.21858758802986072, 0.049874114389338295, 0.08284875142655997, 0.03371510683549888, 0.015446501601943188, 0.014157734062925836, 0.0033520688049391402])
    In [104]:
    combined_words
    
    Out[104]:
    word weight weight.1
    obama 43.2956530720749 19.24251247647773
    act 27.67822262297991 17.298889139362444
    iraq 17.747378587965535 4.436844646991384
    control 14.887060845181308 7.443530422590654
    law 14.722935761763422 2.4538226269605703
    military 13.115932778499415 3.2789831946248538
    democratic 12.410688697332166 9.308016522999125
    us 11.591942692842837 9.659952244035697
    senate 10.164288179703693 10.164288179703693
    nominee 9.43101391473379 4.715506957366895
    [84 rows x 3 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [111]:
    turicreate.toolkits.distances.euclidean(list(combined_words['weight']), list(combined_words['weight.1']))
    
    Out[111]:
    37.90533065920024
    In [109]:
    turicreate.toolkits.distances.euclidean(wiki[wiki['name'] == 'Barack Obama']['tf_idf'][0], 
                                            wiki[wiki['name'] == 'Joe Biden']['tf_idf'][0])
    
    Out[109]:
    123.29745600964294
    In [44]:
    model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=10)
    
    Starting pairwise querying.
    +--------------+---------+-------------+--------------+
    | Query points | # Pairs | % Complete. | Elapsed Time |
    +--------------+---------+-------------+--------------+
    | 0            | 1       | 0.00169288  | 5.258ms      |
    | Done         |         | 100         | 498.118ms    |
    +--------------+---------+-------------+--------------+
    Out[44]:
    query_label reference_label distance rank
    Barack Obama Barack Obama 0.0 1
    Barack Obama Phil Schiliro 106.86101369140928 2
    Barack Obama Jeff Sessions 108.87167421571078 3
    Barack Obama Jesse Lee (politician) 109.04569790902957 4
    Barack Obama Samantha Power 109.10810616502708 5
    Barack Obama Bob Menendez 109.78186710530215 6
    Barack Obama Eric Stern (politician) 109.9577880796839 7
    Barack Obama James A. Guest 110.4138887175989 8
    Barack Obama Roland Grossenbacher 110.47060870018984 9
    Barack Obama Tulsi Gabbard 110.6969979988001 10
    [10 rows x 4 columns]
    In [45]:
    def compute_length(row):
        return len(row['text'].split(' '))
    
    wiki['length'] = wiki.apply(compute_length) 
    
    In [46]:
    nearest_neighbors_euclidean = model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=100)
    nearest_neighbors_euclidean = nearest_neighbors_euclidean.join(wiki[['name', 'length']], on={'reference_label':'name'})
    
    Starting pairwise querying.
    +--------------+---------+-------------+--------------+
    | Query points | # Pairs | % Complete. | Elapsed Time |
    +--------------+---------+-------------+--------------+
    | 0            | 1       | 0.00169288  | 7.05ms       |
    | Done         |         | 100         | 574.183ms    |
    +--------------+---------+-------------+--------------+
    In [47]:
    nearest_neighbors_euclidean.sort('rank')
    
    Out[47]:
    query_label reference_label distance rank length
    Barack Obama Barack Obama 0.0 1 540
    Barack Obama Phil Schiliro 106.86101369140928 2 208
    Barack Obama Jeff Sessions 108.87167421571078 3 230
    Barack Obama Jesse Lee (politician) 109.04569790902957 4 216
    Barack Obama Samantha Power 109.10810616502708 5 310
    Barack Obama Bob Menendez 109.78186710530215 6 220
    Barack Obama Eric Stern (politician) 109.9577880796839 7 255
    Barack Obama James A. Guest 110.4138887175989 8 215
    Barack Obama Roland Grossenbacher 110.47060870018984 9 201
    Barack Obama Tulsi Gabbard 110.6969979988001 10 228
    [100 rows x 5 columns]
    Note: Only the head of the SFrame is printed.
    You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
    In [48]:
    plt.figure(figsize=(10.5,4.5))
    plt.hist(wiki['length'], 50, color='k', edgecolor='None', histtype='stepfilled', normed=True,
             label='Entire Wikipedia', zorder=3, alpha=0.8)
    plt.hist(nearest_neighbors_euclidean['length'], 50, color='r', edgecolor='None', histtype='stepfilled', normed=True,
             label='100 NNs of Obama (Euclidean)', zorder=10, alpha=0.8)
    plt.axvline(x=wiki['length'][wiki['name'] == 'Barack Obama'][0], color='k', linestyle='--', linewidth=4,
               label='Length of Barack Obama', zorder=2)
    plt.axvline(x=wiki['length'][wiki['name'] == 'Joe Biden'][0], color='g', linestyle='--', linewidth=4,
               label='Length of Joe Biden', zorder=1)
    plt.axis([0, 1000, 0, 0.04])
    
    plt.legend(loc='best', prop={'size':15})
    plt.title('Distribution of document length')
    plt.xlabel('# of words')
    plt.ylabel('Percentage')
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-48-52ba4b128709> in <module>
          1 plt.figure(figsize=(10.5,4.5))
          2 plt.hist(wiki['length'], 50, color='k', edgecolor='None', histtype='stepfilled', normed=True,
    ----> 3          label='Entire Wikipedia', zorder=3, alpha=0.8)
          4 plt.hist(nearest_neighbors_euclidean['length'], 50, color='r', edgecolor='None', histtype='stepfilled', normed=True,
          5          label='100 NNs of Obama (Euclidean)', zorder=10, alpha=0.8)
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/pyplot.py in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs)
       2608         align=align, orientation=orientation, rwidth=rwidth, log=log,
       2609         color=color, label=label, stacked=stacked, **({"data": data}
    -> 2610         if data is not None else {}), **kwargs)
       2611 
       2612 
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
       1563     def inner(ax, *args, data=None, **kwargs):
       1564         if data is None:
    -> 1565             return func(ax, *map(sanitize_sequence, args), **kwargs)
       1566 
       1567         bound = new_sig.bind(ax, *args, **kwargs)
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/axes/_axes.py in hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
       6806             if patch:
       6807                 p = patch[0]
    -> 6808                 p.update(kwargs)
       6809                 if lbl is not None:
       6810                     p.set_label(lbl)
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/artist.py in update(self, props)
       1004 
       1005         with cbook._setattr_cm(self, eventson=False):
    -> 1006             ret = [_update_property(self, k, v) for k, v in props.items()]
       1007 
       1008         if len(ret):
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/artist.py in <listcomp>(.0)
       1004 
       1005         with cbook._setattr_cm(self, eventson=False):
    -> 1006             ret = [_update_property(self, k, v) for k, v in props.items()]
       1007 
       1008         if len(ret):
    
    /opt/conda/lib/python3.7/site-packages/matplotlib/artist.py in _update_property(self, k, v)
       1000                 if not callable(func):
       1001                     raise AttributeError('{!r} object has no property {!r}'
    -> 1002                                          .format(type(self).__name__, k))
       1003                 return func(v)
       1004 
    
    AttributeError: 'Polygon' object has no property 'normed'
    In [ ]:
    model2_tf_idf = turicreate.nearest_neighbors.create(wiki, label='name', features=['tf_idf'],
                                                        method='brute_force', distance='cosine')
    
    In [ ]:
    nearest_neighbors_cosine = model2_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=100)
    nearest_neighbors_cosine = nearest_neighbors_cosine.join(wiki[['name', 'length']], on={'reference_label':'name'})
    
    In [ ]:
    nearest_neighbors_cosine.sort('rank')
    
    In [ ]:
    plt.figure(figsize=(10.5,4.5))
    plt.figure(figsize=(10.5,4.5))
    plt.hist(wiki['length'], 50, color='k', edgecolor='None', histtype='stepfilled', normed=True,
             label='Entire Wikipedia', zorder=3, alpha=0.8)
    plt.hist(nearest_neighbors_euclidean['length'], 50, color='r', edgecolor='None', histtype='stepfilled', normed=True,
             label='100 NNs of Obama (Euclidean)', zorder=10, alpha=0.8)
    plt.hist(nearest_neighbors_cosine['length'], 50, color='b', edgecolor='None', histtype='stepfilled', normed=True,
             label='100 NNs of Obama (cosine)', zorder=11, alpha=0.8)
    plt.axvline(x=wiki['length'][wiki['name'] == 'Barack Obama'][0], color='k', linestyle='--', linewidth=4,
               label='Length of Barack Obama', zorder=2)
    plt.axvline(x=wiki['length'][wiki['name'] == 'Joe Biden'][0], color='g', linestyle='--', linewidth=4,
               label='Length of Joe Biden', zorder=1)
    plt.axis([0, 1000, 0, 0.04])
    plt.legend(loc='best', prop={'size':15})
    plt.title('Distribution of document length')
    plt.xlabel('# of words')
    plt.ylabel('Percentage')
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()
    
    In [49]:
    sf = turicreate.SFrame({'text': ['democratic governments control law in response to popular act']})
    sf['word_count'] = turicreate.text_analytics.count_words(sf['text'])
    
    encoder = turicreate.toolkits._feature_engineering.TFIDF(features=['word_count'], output_column_prefix='tf_idf')
    encoder.fit(wiki)
    sf = encoder.transform(sf)
    sf
    
    Out[49]:
    text word_count tf_idf.word_count
    democratic governments
    control law in response ...
    {'act': 1.0, 'popular':
    1.0, 'in': 1.0, 'law': ...
    {'act':
    3.4597778278724887, ...
    [1 rows x 3 columns]
    In [50]:
    tweet_tf_idf = sf[0]['tf_idf.word_count']
    tweet_tf_idf
    
    Out[50]:
    {'act': 3.4597778278724887,
     'popular': 2.764478952022998,
     'in': 0.0009654063501214492,
     'law': 2.4538226269605703,
     'control': 3.721765211295327,
     'response': 4.261461747058352,
     'governments': 4.167571323949673,
     'to': 0.04694493768179923,
     'democratic': 3.1026721743330414}
    In [51]:
    obama = wiki[wiki['name'] == 'Barack Obama']
    obama
    
    Out[51]:
    URI name text word_count has_top_words
    <http://dbpedia.org/resou
    rce/Barack_Obama> ...
    Barack Obama barack hussein obama ii
    brk husen bm born august ...
    {'normalize': 1.0,
    'sought': 1.0, 'combat': ...
    1
    tf_idf length
    {'normalize':
    10.293348208665249, ...
    540
    [? rows x 7 columns]
    Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
    You can use sf.materialize() to force materialization.
    In [52]:
    obama_tf_idf = obama[0]['tf_idf']
    turicreate.toolkits.distances.cosine(obama_tf_idf, tweet_tf_idf)
    
    Out[52]:
    0.7059183777794329
    In [ ]:
    model2_tf_idf.query(obama, label='name', k=10)
    
    In [ ]:
     
    
    In [3]:
    from __future__ import print_function # to conform python 2.x print to python 3.x
    import numpy as np
    import turicreate
    from scipy.sparse import csr_matrix
    from sklearn.metrics.pairwise import pairwise_distances
    import time
    from copy import copy
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    '''compute norm of a sparse vector
       Thanks to: Jaiyam Sharma'''
    def norm(x):
        sum_sq=x.dot(x.T)
        norm=np.sqrt(sum_sq)
        return(norm)
    
    In [4]:
    wiki = turicreate.SFrame('people_wiki.sframe/')
    
    In [5]:
    wiki = wiki.add_row_number()
    
    In [6]:
    wiki['tf_idf'] = turicreate.text_analytics.tf_idf(wiki['text'])
    wiki.head()
    
    Out[6]:
    id URI name text tf_idf
    0 <http://dbpedia.org/resou
    rce/Digby_Morrell> ...
    Digby Morrell digby morrell born 10
    october 1979 is a former ...
    {'melbourne':
    3.8914310119380633, ...
    1 <http://dbpedia.org/resou
    rce/Alfred_J._Lewy> ...
    Alfred J. Lewy alfred j lewy aka sandy
    lewy graduated from ...
    {'time':
    1.3253342074200498, ...
    2 <http://dbpedia.org/resou
    rce/Harpdog_Brown> ...
    Harpdog Brown harpdog brown is a singer
    and harmonica player who ...
    {'society':
    2.4448047262085693, ...
    3 <http://dbpedia.org/resou
    rce/Franz_Rottensteiner> ...
    Franz Rottensteiner franz rottensteiner born
    in waidmannsfeld lower ...
    {'kurdlawitzpreis':
    10.986495389225194, ...
    4 <http://dbpedia.org/resou
    rce/G-Enka> ...
    G-Enka henry krvits born 30
    december 1974 in tallinn ...
    {'curtis':
    5.299520032885375, ...
    5 <http://dbpedia.org/resou
    rce/Sam_Henderson> ...
    Sam Henderson sam henderson born
    october 18 1969 is an ...
    {'asses':
    9.600201028105303, 's ...
    6 <http://dbpedia.org/resou
    rce/Aaron_LaCrate> ...
    Aaron LaCrate aaron lacrate is an
    american music producer ...
    {'streamz':
    10.986495389225194, ...
    7 <http://dbpedia.org/resou
    rce/Trevor_Ferguson> ...
    Trevor Ferguson trevor ferguson aka john
    farrow born 11 november ...
    {'concordia':
    6.250296940830698, ...
    8 <http://dbpedia.org/resou
    rce/Grant_Nelson> ...
    Grant Nelson grant nelson born 27
    april 1971 in london ...
    {'heavies':
    8.907053847545358, 'n ...
    9 <http://dbpedia.org/resou
    rce/Cathy_Caruth> ...
    Cathy Caruth cathy caruth born 1955 is
    frank h t rhodes ...
    {'2002':
    1.8753125887822302, ...
    [10 rows x 5 columns]
    In [7]:
    def sframe_to_scipy(x, column_name):
        '''
        Convert a dictionary column of an SFrame into a sparse matrix format where
        each (row_id, column_id, value) triple corresponds to the value of
        x[row_id][column_id], where column_id is a key in the dictionary.
           
        Example
        >>> sparse_matrix, map_key_to_index = sframe_to_scipy(sframe, column_name)
        '''
        assert type(x[column_name][0]) == dict, \
            'The chosen column must be dict type, representing sparse data.'
        
        # Stack will transform x to have a row for each unique (row, key) pair.
        x = x.stack(column_name, ['feature', 'value'])
        
        # Map feature words to integers 
        unique_words = sorted(x['feature'].unique())
        mapping = {word:i for i, word in enumerate(unique_words)}
        x['feature_id'] = x['feature'].apply(lambda x: mapping[x])
        
        # Create numpy arrays that contain the data for the sparse matrix.
        row_id = np.array(x['id'])
        col_id = np.array(x['feature_id'])
        data = np.array(x['value'])
        
        width = x['id'].max() + 1
        height = x['feature_id'].max() + 1
        
        # Create a sparse matrix.
        mat = csr_matrix((data, (row_id, col_id)), shape=(width, height))
        return mat, mapping
    
    In [8]:
    %%time
    corpus, mapping = sframe_to_scipy(wiki, 'tf_idf')
    
    Using default 16 lambda workers.
    To maximize the degree of parallelism, add the following code to the beginning of the program:
    "turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)"
    Note that increasing the degree of parallelism also increases the memory footprint.
    CPU times: user 3min 23s, sys: 11.7 s, total: 3min 34s
    Wall time: 3min 17s
    
    In [9]:
    assert corpus.shape == (59071, 547979)
    print('Check passed correctly!')
    
    Check passed correctly!
    
    In [10]:
    def generate_random_vectors(dim, n_vectors):
        return np.random.randn(dim, n_vectors)
    
    In [11]:
    # Generate 16 random vectors of dimension 547979
    np.random.seed(0)
    n_vectors = 16
    random_vectors = generate_random_vectors(corpus.shape[1], n_vectors)
    random_vectors.shape
    
    Out[11]:
    (547979, 16)
    In [12]:
    sample = corpus[0] # vector of tf-idf values for document 0
    bin_indices_bits = sample.dot(random_vectors[:,0]) >= 0
    bin_indices_bits
    
    Out[12]:
    array([ True])
    In [13]:
    sample.dot(random_vectors[:, 1]) >= 0 # True if positive sign; False if negative sign
    
    Out[13]:
    array([False])
    In [14]:
    sample.dot(random_vectors) >= 0 # should return an array of 16 True/False bits
    
    Out[14]:
    array([[ True, False, False, False,  True, False,  True, False,  True,
             True,  True, False,  True,  True, False,  True]])
    In [15]:
    np.array(sample.dot(random_vectors) >= 0, dtype=int) # display index bits in 0/1's
    
    Out[15]:
    array([[1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1]])
    In [16]:
    corpus[0:2].dot(random_vectors) >= 0 # compute bit indices of first two documents
    
    Out[16]:
    array([[ True, False, False, False,  True, False,  True, False,  True,
             True,  True, False,  True,  True, False,  True],
           [False, False, False, False,  True, False, False, False,  True,
             True,  True, False, False,  True, False,  True]])
    In [17]:
    corpus.dot(random_vectors) >= 0 # compute bit indices of ALL documents
    
    Out[17]:
    array([[ True, False, False, ...,  True, False,  True],
           [False, False, False, ...,  True, False,  True],
           [ True,  True,  True, ...,  True,  True,  True],
           ...,
           [False, False, False, ..., False, False,  True],
           [ True,  True, False, ...,  True,  True, False],
           [ True,  True,  True, ...,  True, False,  True]])
    In [18]:
    index_bits = (sample.dot(random_vectors) >= 0)
    powers_of_two = (1 << np.arange(15, -1, -1))
    print(index_bits)
    print(powers_of_two)
    print(index_bits.dot(powers_of_two))
    
    [[ True False False False  True False  True False  True  True  True False
       True  True False  True]]
    [32768 16384  8192  4096  2048  1024   512   256   128    64    32    16
         8     4     2     1]
    [35565]
    
    In [19]:
    index_bits = sample.dot(random_vectors) >= 0
    index_bits.dot(powers_of_two)
    
    Out[19]:
    array([35565])
    In [20]:
    from collections import defaultdict 
    
    def train_lsh(data, n_vectors, seed=None):    
        if seed is not None:
            np.random.seed(seed)
    
        dim = data.shape[1]
        random_vectors = generate_random_vectors(dim, n_vectors)  
    
        # Partition data points into bins,
        # and encode bin index bits into integers
        bin_indices_bits = data.dot(random_vectors) >= 0
        powers_of_two = 1 << np.arange(n_vectors - 1, -1, step=-1)
        bin_indices = bin_indices_bits.dot(powers_of_two)
    
        # Update `table` so that `table[i]` is the list of document ids with bin index equal to i
        table = defaultdict(list)
        for idx, bin_index in enumerate(bin_indices):
            # Fetch the list of document ids associated with the bin and add the document id to the end.
            # data_index: document ids
            # append() will add a list of document ids to table dict() with key as bin_index
            table[bin_index].append(idx)   # YOUR CODE HERE
        
        # Note that we're storing the bin_indices here
        # so we can do some ad-hoc checking with it,
        # this isn't actually required
        model = {'data': data,
                 'table': table,
                 'random_vectors': random_vectors,
                 'bin_indices': bin_indices,
                 'bin_indices_bits': bin_indices_bits}
        return model
    
    In [21]:
    def compare_bits(model, id_1, id_2):
        bits1 = model['bin_indices_bits'][id_1]
        bits2 = model['bin_indices_bits'][id_2]
        print('Number of agreed bits: ', np.sum(bits1 == bits2))
        return np.sum(bits1 == bits2)
    
    In [22]:
    model = train_lsh(corpus, 16, seed=475)
    obama_id = wiki[wiki['name'] == 'Barack Obama']['id'][0]
    biden_id = wiki[wiki['name'] == 'Joe Biden']['id'][0]
    similariy = compare_bits(model, obama_id, biden_id)
    
    Number of agreed bits:  15
    
    In [23]:
    # This function will help us get similar items, given the id
    def get_similarity_items(X_tfidf, item_id, topn=5):
        """
        Get the top similar items for a given item id.
        The similarity measure here is based on cosine distance.
        """
        query = X_tfidf[item_id]
        scores = X_tfidf.dot(query.T).toarray().ravel()
        best = np.argpartition(scores, -topn)[-topn:]
        similar_items = sorted(zip(best, scores[best]), key=lambda x: -x[1])
        similar_item_ids = [similar_item for similar_item, _ in similar_items]
        print("Similar items to id: {}".format(item_id))
        for _id in similar_item_ids:
            print(wiki[_id]['name'])
        print('\n')
        return similar_item_ids
    
    In [24]:
    wiki[wiki['name'] == 'Barack Obama']
    
    Out[24]:
    id URI name text tf_idf
    35817 <http://dbpedia.org/resou
    rce/Barack_Obama> ...
    Barack Obama barack hussein obama ii
    brk husen bm born august ...
    {'normalize':
    10.293348208665249, ...
    [? rows x 5 columns]
    Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
    You can use sf.materialize() to force materialization.
    In [25]:
    obama_id
    
    Out[25]:
    35817
    In [26]:
    s = ''.join(map(str,model['bin_indices_bits'][obama_id].astype(int)))
    sum(int(c) * (2 ** i) for i, c in enumerate(s[::-1]))
    
    Out[26]:
    38448
    In [27]:
    #s = '1110'
    #sum(int(c) * (2 ** i) for i, c in enumerate(s[::-1]))
    
    In [28]:
    wiki[wiki['name'] == 'Joe Biden']
    
    Out[28]:
    id URI name text tf_idf
    24478 <http://dbpedia.org/resou
    rce/Joe_Biden> ...
    Joe Biden joseph robinette joe
    biden jr dosf rbnt badn ...
    {'were':
    1.521978023354629, ...
    [? rows x 5 columns]
    Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
    You can use sf.materialize() to force materialization.
    In [29]:
    so = ''.join(map(str,model['bin_indices_bits'][obama_id].astype(int)))
    sb = ''.join(map(str,model['bin_indices_bits'][biden_id].astype(int)))
    sum([so[i]==sb[i] for i in range(len(so))])
    
    Out[29]:
    15
    In [30]:
    jones_id = wiki[wiki['name']=='Wynn Normington Hugh-Jones']['id'][0]
    compare_bits(model, obama_id, jones_id)
    
    Number of agreed bits:  8
    
    Out[30]:
    8
    In [31]:
    model['bin_indices'][obama_id]
    
    Out[31]:
    38448
    In [32]:
    model['table'][model['bin_indices'][obama_id]]
    
    Out[32]:
    [35817, 54743]
    In [33]:
    doc_ids = list(model['table'][model['bin_indices'][35817]])
    doc_ids.remove(35817) # display documents other than Obama
    
    docs = wiki.filter_by(values=doc_ids, column_name='id') # filter by id column
    docs
    
    Out[33]:
    id URI name text tf_idf
    54743 <http://dbpedia.org/resou
    rce/Radovan_%C5%BDerjav> ...
    Radovan %C5%BDerjav radovan erjav born 2
    december 1968 is a ...
    {'hungarian':
    5.299520032885375, ...
    [1 rows x 5 columns]
    In [34]:
    res = compare_bits(model, obama_id, docs[0]['id']), compare_bits(model, obama_id, biden_id)
    
    Number of agreed bits:  16
    Number of agreed bits:  15
    
    In [35]:
    from itertools import combinations
    
    In [36]:
    num_vector = 16
    search_radius = 3
    
    for diff in combinations(range(num_vector), search_radius):
        print(diff)
    
    (0, 1, 2)
    (0, 1, 3)
    (0, 1, 4)
    (0, 1, 5)
    (0, 1, 6)
    (0, 1, 7)
    (0, 1, 8)
    (0, 1, 9)
    (0, 1, 10)
    (0, 1, 11)
    (0, 1, 12)
    (0, 1, 13)
    (0, 1, 14)
    (0, 1, 15)
    (0, 2, 3)
    (0, 2, 4)
    (0, 2, 5)
    (0, 2, 6)
    (0, 2, 7)
    (0, 2, 8)
    (0, 2, 9)
    (0, 2, 10)
    (0, 2, 11)
    (0, 2, 12)
    (0, 2, 13)
    (0, 2, 14)
    (0, 2, 15)
    (0, 3, 4)
    (0, 3, 5)
    (0, 3, 6)
    (0, 3, 7)
    (0, 3, 8)
    (0, 3, 9)
    (0, 3, 10)
    (0, 3, 11)
    (0, 3, 12)
    (0, 3, 13)
    (0, 3, 14)
    (0, 3, 15)
    (0, 4, 5)
    (0, 4, 6)
    (0, 4, 7)
    (0, 4, 8)
    (0, 4, 9)
    (0, 4, 10)
    (0, 4, 11)
    (0, 4, 12)
    (0, 4, 13)
    (0, 4, 14)
    (0, 4, 15)
    (0, 5, 6)
    (0, 5, 7)
    (0, 5, 8)
    (0, 5, 9)
    (0, 5, 10)
    (0, 5, 11)
    (0, 5, 12)
    (0, 5, 13)
    (0, 5, 14)
    (0, 5, 15)
    (0, 6, 7)
    (0, 6, 8)
    (0, 6, 9)
    (0, 6, 10)
    (0, 6, 11)
    (0, 6, 12)
    (0, 6, 13)
    (0, 6, 14)
    (0, 6, 15)
    (0, 7, 8)
    (0, 7, 9)
    (0, 7, 10)
    (0, 7, 11)
    (0, 7, 12)
    (0, 7, 13)
    (0, 7, 14)
    (0, 7, 15)
    (0, 8, 9)
    (0, 8, 10)
    (0, 8, 11)
    (0, 8, 12)
    (0, 8, 13)
    (0, 8, 14)
    (0, 8, 15)
    (0, 9, 10)
    (0, 9, 11)
    (0, 9, 12)
    (0, 9, 13)
    (0, 9, 14)
    (0, 9, 15)
    (0, 10, 11)
    (0, 10, 12)
    (0, 10, 13)
    (0, 10, 14)
    (0, 10, 15)
    (0, 11, 12)
    (0, 11, 13)
    (0, 11, 14)
    (0, 11, 15)
    (0, 12, 13)
    (0, 12, 14)
    (0, 12, 15)
    (0, 13, 14)
    (0, 13, 15)
    (0, 14, 15)
    (1, 2, 3)
    (1, 2, 4)
    (1, 2, 5)
    (1, 2, 6)
    (1, 2, 7)
    (1, 2, 8)
    (1, 2, 9)
    (1, 2, 10)
    (1, 2, 11)
    (1, 2, 12)
    (1, 2, 13)
    (1, 2, 14)
    (1, 2, 15)
    (1, 3, 4)
    (1, 3, 5)
    (1, 3, 6)
    (1, 3, 7)
    (1, 3, 8)
    (1, 3, 9)
    (1, 3, 10)
    (1, 3, 11)
    (1, 3, 12)
    (1, 3, 13)
    (1, 3, 14)
    (1, 3, 15)
    (1, 4, 5)
    (1, 4, 6)
    (1, 4, 7)
    (1, 4, 8)
    (1, 4, 9)
    (1, 4, 10)
    (1, 4, 11)
    (1, 4, 12)
    (1, 4, 13)
    (1, 4, 14)
    (1, 4, 15)
    (1, 5, 6)
    (1, 5, 7)
    (1, 5, 8)
    (1, 5, 9)
    (1, 5, 10)
    (1, 5, 11)
    (1, 5, 12)
    (1, 5, 13)
    (1, 5, 14)
    (1, 5, 15)
    (1, 6, 7)
    (1, 6, 8)
    (1, 6, 9)
    (1, 6, 10)
    (1, 6, 11)
    (1, 6, 12)
    (1, 6, 13)
    (1, 6, 14)
    (1, 6, 15)
    (1, 7, 8)
    (1, 7, 9)
    (1, 7, 10)
    (1, 7, 11)
    (1, 7, 12)
    (1, 7, 13)
    (1, 7, 14)
    (1, 7, 15)
    (1, 8, 9)
    (1, 8, 10)
    (1, 8, 11)
    (1, 8, 12)
    (1, 8, 13)
    (1, 8, 14)
    (1, 8, 15)
    (1, 9, 10)
    (1, 9, 11)
    (1, 9, 12)
    (1, 9, 13)
    (1, 9, 14)
    (1, 9, 15)
    (1, 10, 11)
    (1, 10, 12)
    (1, 10, 13)
    (1, 10, 14)
    (1, 10, 15)
    (1, 11, 12)
    (1, 11, 13)
    (1, 11, 14)
    (1, 11, 15)
    (1, 12, 13)
    (1, 12, 14)
    (1, 12, 15)
    (1, 13, 14)
    (1, 13, 15)
    (1, 14, 15)
    (2, 3, 4)
    (2, 3, 5)
    (2, 3, 6)
    (2, 3, 7)
    (2, 3, 8)
    (2, 3, 9)
    (2, 3, 10)
    (2, 3, 11)
    (2, 3, 12)
    (2, 3, 13)
    (2, 3, 14)
    (2, 3, 15)
    (2, 4, 5)
    (2, 4, 6)
    (2, 4, 7)
    (2, 4, 8)
    (2, 4, 9)
    (2, 4, 10)
    (2, 4, 11)
    (2, 4, 12)
    (2, 4, 13)
    (2, 4, 14)
    (2, 4, 15)
    (2, 5, 6)
    (2, 5, 7)
    (2, 5, 8)
    (2, 5, 9)
    (2, 5, 10)
    (2, 5, 11)
    (2, 5, 12)
    (2, 5, 13)
    (2, 5, 14)
    (2, 5, 15)
    (2, 6, 7)
    (2, 6, 8)
    (2, 6, 9)
    (2, 6, 10)
    (2, 6, 11)
    (2, 6, 12)
    (2, 6, 13)
    (2, 6, 14)
    (2, 6, 15)
    (2, 7, 8)
    (2, 7, 9)
    (2, 7, 10)
    (2, 7, 11)
    (2, 7, 12)
    (2, 7, 13)
    (2, 7, 14)
    (2, 7, 15)
    (2, 8, 9)
    (2, 8, 10)
    (2, 8, 11)
    (2, 8, 12)
    (2, 8, 13)
    (2, 8, 14)
    (2, 8, 15)
    (2, 9, 10)
    (2, 9, 11)
    (2, 9, 12)
    (2, 9, 13)
    (2, 9, 14)
    (2, 9, 15)
    (2, 10, 11)
    (2, 10, 12)
    (2, 10, 13)
    (2, 10, 14)
    (2, 10, 15)
    (2, 11, 12)
    (2, 11, 13)
    (2, 11, 14)
    (2, 11, 15)
    (2, 12, 13)
    (2, 12, 14)
    (2, 12, 15)
    (2, 13, 14)
    (2, 13, 15)
    (2, 14, 15)
    (3, 4, 5)
    (3, 4, 6)
    (3, 4, 7)
    (3, 4, 8)
    (3, 4, 9)
    (3, 4, 10)
    (3, 4, 11)
    (3, 4, 12)
    (3, 4, 13)
    (3, 4, 14)
    (3, 4, 15)
    (3, 5, 6)
    (3, 5, 7)
    (3, 5, 8)
    (3, 5, 9)
    (3, 5, 10)
    (3, 5, 11)
    (3, 5, 12)
    (3, 5, 13)
    (3, 5, 14)
    (3, 5, 15)
    (3, 6, 7)
    (3, 6, 8)
    (3, 6, 9)
    (3, 6, 10)
    (3, 6, 11)
    (3, 6, 12)
    (3, 6, 13)
    (3, 6, 14)
    (3, 6, 15)
    (3, 7, 8)
    (3, 7, 9)
    (3, 7, 10)
    (3, 7, 11)
    (3, 7, 12)
    (3, 7, 13)
    (3, 7, 14)
    (3, 7, 15)
    (3, 8, 9)
    (3, 8, 10)
    (3, 8, 11)
    (3, 8, 12)
    (3, 8, 13)
    (3, 8, 14)
    (3, 8, 15)
    (3, 9, 10)
    (3, 9, 11)
    (3, 9, 12)
    (3, 9, 13)
    (3, 9, 14)
    (3, 9, 15)
    (3, 10, 11)
    (3, 10, 12)
    (3, 10, 13)
    (3, 10, 14)
    (3, 10, 15)
    (3, 11, 12)
    (3, 11, 13)
    (3, 11, 14)
    (3, 11, 15)
    (3, 12, 13)
    (3, 12, 14)
    (3, 12, 15)
    (3, 13, 14)
    (3, 13, 15)
    (3, 14, 15)
    (4, 5, 6)
    (4, 5, 7)
    (4, 5, 8)
    (4, 5, 9)
    (4, 5, 10)
    (4, 5, 11)
    (4, 5, 12)
    (4, 5, 13)
    (4, 5, 14)
    (4, 5, 15)
    (4, 6, 7)
    (4, 6, 8)
    (4, 6, 9)
    (4, 6, 10)
    (4, 6, 11)
    (4, 6, 12)
    (4, 6, 13)
    (4, 6, 14)
    (4, 6, 15)
    (4, 7, 8)
    (4, 7, 9)
    (4, 7, 10)
    (4, 7, 11)
    (4, 7, 12)
    (4, 7, 13)
    (4, 7, 14)
    (4, 7, 15)
    (4, 8, 9)
    (4, 8, 10)
    (4, 8, 11)
    (4, 8, 12)
    (4, 8, 13)
    (4, 8, 14)
    (4, 8, 15)
    (4, 9, 10)
    (4, 9, 11)
    (4, 9, 12)
    (4, 9, 13)
    (4, 9, 14)
    (4, 9, 15)
    (4, 10, 11)
    (4, 10, 12)
    (4, 10, 13)
    (4, 10, 14)
    (4, 10, 15)
    (4, 11, 12)
    (4, 11, 13)
    (4, 11, 14)
    (4, 11, 15)
    (4, 12, 13)
    (4, 12, 14)
    (4, 12, 15)
    (4, 13, 14)
    (4, 13, 15)
    (4, 14, 15)
    (5, 6, 7)
    (5, 6, 8)
    (5, 6, 9)
    (5, 6, 10)
    (5, 6, 11)
    (5, 6, 12)
    (5, 6, 13)
    (5, 6, 14)
    (5, 6, 15)
    (5, 7, 8)
    (5, 7, 9)
    (5, 7, 10)
    (5, 7, 11)
    (5, 7, 12)
    (5, 7, 13)
    (5, 7, 14)
    (5, 7, 15)
    (5, 8, 9)
    (5, 8, 10)
    (5, 8, 11)
    (5, 8, 12)
    (5, 8, 13)
    (5, 8, 14)
    (5, 8, 15)
    (5, 9, 10)
    (5, 9, 11)
    (5, 9, 12)
    (5, 9, 13)
    (5, 9, 14)
    (5, 9, 15)
    (5, 10, 11)
    (5, 10, 12)
    (5, 10, 13)
    (5, 10, 14)
    (5, 10, 15)
    (5, 11, 12)
    (5, 11, 13)
    (5, 11, 14)
    (5, 11, 15)
    (5, 12, 13)
    (5, 12, 14)
    (5, 12, 15)
    (5, 13, 14)
    (5, 13, 15)
    (5, 14, 15)
    (6, 7, 8)
    (6, 7, 9)
    (6, 7, 10)
    (6, 7, 11)
    (6, 7, 12)
    (6, 7, 13)
    (6, 7, 14)
    (6, 7, 15)
    (6, 8, 9)
    (6, 8, 10)
    (6, 8, 11)
    (6, 8, 12)
    (6, 8, 13)
    (6, 8, 14)
    (6, 8, 15)
    (6, 9, 10)
    (6, 9, 11)
    (6, 9, 12)
    (6, 9, 13)
    (6, 9, 14)
    (6, 9, 15)
    (6, 10, 11)
    (6, 10, 12)
    (6, 10, 13)
    (6, 10, 14)
    (6, 10, 15)
    (6, 11, 12)
    (6, 11, 13)
    (6, 11, 14)
    (6, 11, 15)
    (6, 12, 13)
    (6, 12, 14)
    (6, 12, 15)
    (6, 13, 14)
    (6, 13, 15)
    (6, 14, 15)
    (7, 8, 9)
    (7, 8, 10)
    (7, 8, 11)
    (7, 8, 12)
    (7, 8, 13)
    (7, 8, 14)
    (7, 8, 15)
    (7, 9, 10)
    (7, 9, 11)
    (7, 9, 12)
    (7, 9, 13)
    (7, 9, 14)
    (7, 9, 15)
    (7, 10, 11)
    (7, 10, 12)
    (7, 10, 13)
    (7, 10, 14)
    (7, 10, 15)
    (7, 11, 12)
    (7, 11, 13)
    (7, 11, 14)
    (7, 11, 15)
    (7, 12, 13)
    (7, 12, 14)
    (7, 12, 15)
    (7, 13, 14)
    (7, 13, 15)
    (7, 14, 15)
    (8, 9, 10)
    (8, 9, 11)
    (8, 9, 12)
    (8, 9, 13)
    (8, 9, 14)
    (8, 9, 15)
    (8, 10, 11)
    (8, 10, 12)
    (8, 10, 13)
    (8, 10, 14)
    (8, 10, 15)
    (8, 11, 12)
    (8, 11, 13)
    (8, 11, 14)
    (8, 11, 15)
    (8, 12, 13)
    (8, 12, 14)
    (8, 12, 15)
    (8, 13, 14)
    (8, 13, 15)
    (8, 14, 15)
    (9, 10, 11)
    (9, 10, 12)
    (9, 10, 13)
    (9, 10, 14)
    (9, 10, 15)
    (9, 11, 12)
    (9, 11, 13)
    (9, 11, 14)
    (9, 11, 15)
    (9, 12, 13)
    (9, 12, 14)
    (9, 12, 15)
    (9, 13, 14)
    (9, 13, 15)
    (9, 14, 15)
    (10, 11, 12)
    (10, 11, 13)
    (10, 11, 14)
    (10, 11, 15)
    (10, 12, 13)
    (10, 12, 14)
    (10, 12, 15)
    (10, 13, 14)
    (10, 13, 15)
    (10, 14, 15)
    (11, 12, 13)
    (11, 12, 14)
    (11, 12, 15)
    (11, 13, 14)
    (11, 13, 15)
    (11, 14, 15)
    (12, 13, 14)
    (12, 13, 15)
    (12, 14, 15)
    (13, 14, 15)
    
    In [37]:
    def search_nearby_bins(query_bin_bits, table, search_radius=2, initial_candidates=set()):
        """
        For a given query vector and trained LSH model, return all candidate neighbors for
        the query among all bins within the given search radius.
        
        Example usage
        -------------
        >>> model = train_lsh(corpus, num_vector=16, seed=143)
        >>> q = model['bin_index_bits'][0]  # vector for the first document
      
        >>> candidates = search_nearby_bins(q, model['table'])
        """
        num_vector = len(query_bin_bits)
        powers_of_two = 1 << np.arange(num_vector-1, -1, -1)
        
        # Allow the user to provide an initial set of candidates.
        candidate_set = copy(initial_candidates)
        
        for different_bits in combinations(range(num_vector), search_radius):       
            # Flip the bits (n_1,n_2,...,n_r) of the query bin to produce a new bit vector.
            ## Hint: you can iterate over a tuple like a list
            alternate_bits = copy(query_bin_bits)
            for i in different_bits:
                alternate_bits[i] = 1 - alternate_bits[i] # YOUR CODE HERE 
            
            # Convert the new bit vector to an integer index
            nearby_bin = alternate_bits.dot(powers_of_two)
            
            # Fetch the list of documents belonging to the bin indexed by the new bit vector.
            # Then add those documents to candidate_set
            # Make sure that the bin exists in the table!
            # Hint: update() method for sets lets you add an entire list to the set
            if nearby_bin in table:
                more_docs = table[nearby_bin] # Get all document_ids of the bin
                ... # YOUR CODE HERE: Update candidate_set with the documents in this bin.
                candidate_set.update(more_docs)
                
        return candidate_set
    
    In [38]:
    obama_bin_index = model['bin_indices_bits'][35817] # bin index of Barack Obama
    candidate_set = search_nearby_bins(obama_bin_index, model['table'], search_radius=0)
    if candidate_set == set({35817, 54743}):
        print('Passed test')
    else:
        print('Check your code')
    print('List of documents in the same bin as Obama: {}'.format(candidate_set))
    
    Passed test
    List of documents in the same bin as Obama: {35817, 54743}
    
    In [39]:
    candidate_set = search_nearby_bins(obama_bin_index, model['table'], search_radius=1, initial_candidates=candidate_set)
    if candidate_set == set({42243, 28804, 1810, 48919, 24478, 31010, 7331, 23716, 51108, 48040, 36266, 33200, 25023, 23617, 54743, 34910, 35817, 34159, 14451, 23926, 39032, 12028, 43775}):
        print('Passed test')
    else:
        print('Check your code')
    print(candidate_set)
    
    Passed test
    {42243, 28804, 1810, 48919, 24478, 31010, 7331, 23716, 51108, 48040, 36266, 33200, 25023, 23617, 54743, 34910, 35817, 34159, 14451, 23926, 39032, 12028, 43775}
    
    In [40]:
    def query(vec, model, k, max_search_radius):
      
        data = model['data']
        table = model['table']
        random_vectors = model['random_vectors']
        num_vector = random_vectors.shape[1]
        
        
        # Compute bin index for the query vector, in bit representation.
        bin_index_bits = (vec.dot(random_vectors) >= 0).flatten()
        
        # Search nearby bins and collect candidates
        candidate_set = set()
        for search_radius in range(max_search_radius+1):
            candidate_set = search_nearby_bins(bin_index_bits, table, search_radius, initial_candidates=candidate_set)
        
        # Sort candidates by their true distances from the query
        nearest_neighbors = turicreate.SFrame({'id':candidate_set})
        candidates = data[np.array(list(candidate_set)),:]
        nearest_neighbors['distance'] = pairwise_distances(candidates, vec, metric='cosine').flatten()
        
        return nearest_neighbors.topk('distance', k, reverse=True), len(candidate_set)
    
    In [41]:
    query(corpus[35817,:], model, k=10, max_search_radius=3)
    
    Out[41]:
    (Columns:
     	id	int
     	distance	float
     
     Rows: 10
     
     Data:
     +-------+------------------------+
     |   id  |        distance        |
     +-------+------------------------+
     | 35817 | 1.1102230246251565e-16 |
     | 24478 |   0.703138676733575    |
     | 38376 |   0.7429819023278823   |
     |  4032 |   0.8145547486714284   |
     | 43155 |   0.8408390074837325   |
     | 20159 |   0.844036884280093    |
     | 11517 |   0.8483420107162964   |
     | 46332 |   0.8897020225435585   |
     | 22063 |   0.8946710479694914   |
     | 10437 |   0.9001571479475322   |
     +-------+------------------------+
     [10 rows x 2 columns],
     771)
    In [42]:
    query(corpus[35817,:], model, k=10, max_search_radius=3)[0].join(wiki[['id', 'name']], on='id').sort('distance')
    
    Out[42]:
    id distance name
    35817 1.1102230246251565e-16 Barack Obama
    24478 0.703138676733575 Joe Biden
    38376 0.7429819023278823 Samantha Power
    4032 0.8145547486714284 Kenneth D. Thompson
    43155 0.8408390074837325 Goodwin Liu
    20159 0.844036884280093 Charlie Crist
    11517 0.8483420107162964 Louis Susman
    46332 0.8897020225435585 Tom Tancredo
    22063 0.8946710479694914 Kathryn Troutman
    10437 0.9001571479475322 David J. Hayes
    [10 rows x 3 columns]
    In [43]:
    wiki[wiki['name']=='Barack Obama']
    
    Out[43]:
    id URI name text tf_idf
    35817 <http://dbpedia.org/resou
    rce/Barack_Obama> ...
    Barack Obama barack hussein obama ii
    brk husen bm born august ...
    {'normalize':
    10.293348208665249, ...
    [? rows x 5 columns]
    Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
    You can use sf.materialize() to force materialization.
    In [45]:
    %%time
    num_candidates_history = []
    query_time_history = []
    max_distance_from_query_history = []
    min_distance_from_query_history = []
    average_distance_from_query_history = []
    
    for max_search_radius in range(17):
        start=time.time()
        result, num_candidates = query(corpus[35817,:], model, k=10,
                                       max_search_radius=max_search_radius)
        end=time.time()
        query_time = end-start
        
        print('Radius:', max_search_radius)
        print(result.join(wiki[['id', 'name']], on='id').sort('distance'))
        #res = list(result.join(wiki[['id', 'name']], on='id')['distance'])
        #print(res)
        #print(np.mean(res))
    
        average_distance_from_query = result['distance'][1:].mean()
        max_distance_from_query = result['distance'][1:].max()
        min_distance_from_query = result['distance'][1:].min()
        print(average_distance_from_query)
        
        num_candidates_history.append(num_candidates)
        query_time_history.append(query_time)
        average_distance_from_query_history.append(average_distance_from_query)
        max_distance_from_query_history.append(max_distance_from_query)
        min_distance_from_query_history.append(min_distance_from_query)
    
    Radius: 0
    +-------+------------------------+---------------------+
    |   id  |        distance        |         name        |
    +-------+------------------------+---------------------+
    | 35817 | 1.1102230246251565e-16 |     Barack Obama    |
    | 54743 |    0.97334561058472    | Radovan %C5%BDerjav |
    +-------+------------------------+---------------------+
    [2 rows x 3 columns]
    
    0.97334561058472
    Radius: 1
    +-------+------------------------+------------------------------+
    |   id  |        distance        |             name             |
    +-------+------------------------+------------------------------+
    | 35817 | 1.1102230246251565e-16 |         Barack Obama         |
    | 24478 |   0.703138676733575    |          Joe Biden           |
    | 34159 |   0.9430865736846581   |       Jennifer Hudson        |
    | 23926 |   0.9608039657958866   | Se%C3%A1n Power (politician) |
    | 36266 |   0.9615994281067699   |         Ralph Weber          |
    | 33200 |    0.97101213334657    |         Emrah Yucel          |
    | 28804 |   0.9729856623983864   |     Matthew McConaughey      |
    | 54743 |    0.97334561058472    |     Radovan %C5%BDerjav      |
    |  7331 |   0.9735542076945761   |       Joselo D%C3%ADaz       |
    | 43775 |   0.9790856334914729   |       Carly Rae Jepsen       |
    +-------+------------------------+------------------------------+
    [10 rows x 3 columns]
    
    0.9376235435374016
    Radius: 2
    +-------+------------------------+--------------------------------+
    |   id  |        distance        |              name              |
    +-------+------------------------+--------------------------------+
    | 35817 | 1.1102230246251565e-16 |          Barack Obama          |
    | 24478 |   0.703138676733575    |           Joe Biden            |
    |  9051 |   0.9008406076426497   |         Newt Gingrich          |
    | 46253 |   0.9158128432084635   |        Francisco Rezek         |
    |  110  |   0.9346379007684388   |      Abdel Fattah el-Sisi      |
    | 40837 |   0.9370458058764901   |    Dovey Johnson Roundtree     |
    | 34159 |   0.9430865736846581   |        Jennifer Hudson         |
    | 28320 |   0.9460512776474829   |          Robert Reich          |
    | 33070 |   0.9466276757770745   |          Claude Allen          |
    |  3818 |   0.9484009651426074   | Tom Sawyer (Kansas politician) |
    +-------+------------------------+--------------------------------+
    [10 rows x 3 columns]
    
    0.9084047029423821
    Radius: 3
    +-------+------------------------+---------------------+
    |   id  |        distance        |         name        |
    +-------+------------------------+---------------------+
    | 35817 | 1.1102230246251565e-16 |     Barack Obama    |
    | 24478 |   0.703138676733575    |      Joe Biden      |
    | 38376 |   0.7429819023278823   |    Samantha Power   |
    |  4032 |   0.8145547486714284   | Kenneth D. Thompson |
    | 43155 |   0.8408390074837325   |     Goodwin Liu     |
    | 20159 |   0.844036884280093    |    Charlie Crist    |
    | 11517 |   0.8483420107162964   |     Louis Susman    |
    | 46332 |   0.8897020225435585   |     Tom Tancredo    |
    | 22063 |   0.8946710479694914   |   Kathryn Troutman  |
    | 10437 |   0.9001571479475322   |    David J. Hayes   |
    +-------+------------------------+---------------------+
    [10 rows x 3 columns]
    
    0.8309359387415098
    Radius: 4
    +-------+------------------------+---------------------+
    |   id  |        distance        |         name        |
    +-------+------------------------+---------------------+
    | 35817 | 1.1102230246251565e-16 |     Barack Obama    |
    | 24478 |   0.703138676733575    |      Joe Biden      |
    | 38376 |   0.7429819023278823   |    Samantha Power   |
    | 23737 |   0.8101646334648858   |  John D. McCormick  |
    |  4032 |   0.8145547486714284   | Kenneth D. Thompson |
    | 14754 |   0.826854025896727    |     Mitt Romney     |
    | 43155 |   0.8408390074837325   |     Goodwin Liu     |
    | 20159 |   0.844036884280093    |    Charlie Crist    |
    | 11517 |   0.8483420107162964   |     Louis Susman    |
    | 40184 |   0.8601570123329991   |     Chuck Hagel     |
    +-------+------------------------+---------------------+
    [10 rows x 3 columns]
    
    0.8101187668786243
    Radius: 5
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    | 23737 |   0.8101646334648858   |    John D. McCormick    |
    |  4032 |   0.8145547486714284   |   Kenneth D. Thompson   |
    | 14754 |   0.826854025896727    |       Mitt Romney       |
    | 24848 |   0.8394067356676752   |     John C. Eastman     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7886727473079377
    Radius: 6
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    | 23737 |   0.8101646334648858   |    John D. McCormick    |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7735510321399474
    Radius: 7
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 8
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 9
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 10
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 11
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 12
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 13
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 14
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 15
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    Radius: 16
    +-------+------------------------+-------------------------+
    |   id  |        distance        |           name          |
    +-------+------------------------+-------------------------+
    | 35817 | 1.1102230246251565e-16 |       Barack Obama      |
    | 24478 |   0.703138676733575    |        Joe Biden        |
    | 38376 |   0.7429819023278823   |      Samantha Power     |
    | 57108 |   0.7583583978869675   |  Hillary Rodham Clinton |
    | 38714 |   0.7705612276009974   | Eric Stern (politician) |
    | 46140 |    0.78467750475065    |       Robert Gibbs      |
    |  6796 |   0.7880390729434776   |       Eric Holder       |
    | 44681 |   0.790926415366316    |  Jesse Lee (politician) |
    | 18827 |   0.7983226028934733   |       Henry Waxman      |
    |  2412 |   0.799466360041952    |     Joe the Plumber     |
    +-------+------------------------+-------------------------+
    [10 rows x 3 columns]
    
    0.7707191289494768
    CPU times: user 15.2 s, sys: 1.51 s, total: 16.7 s
    Wall time: 15.4 s
    
    In [99]:
    plt.figure(figsize=(7,4.5))
    plt.plot(num_candidates_history, linewidth=4)
    plt.xlabel('Search radius')
    plt.ylabel('# of documents searched')
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    plt.figure(figsize=(7,4.5))
    plt.plot(query_time_history, linewidth=4)
    plt.xlabel('Search radius')
    plt.ylabel('Query time (seconds)')
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    plt.figure(figsize=(7,4.5))
    plt.plot(average_distance_from_query_history, linewidth=4, label='Average of 10 neighbors')
    plt.plot(max_distance_from_query_history, linewidth=4, label='Farthest of 10 neighbors')
    plt.plot(min_distance_from_query_history, linewidth=4, label='Closest of 10 neighbors')
    plt.xlabel('Search radius')
    plt.ylabel('Cosine distance of neighbors')
    plt.legend(loc='best', prop={'size':15})
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    In [ ]:
     
    
    In [2]:
    from __future__ import print_function # to conform python 2.x print to python 3.x
    import turicreate
    import matplotlib.pyplot as plt
    import numpy as np
    import sys
    import os
    from scipy.sparse import csr_matrix
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    
    %matplotlib inline
    
    In [3]:
    wiki = turicreate.SFrame('people_wiki.sframe/')
    
    In [4]:
    wiki['tf_idf'] = turicreate.text_analytics.tf_idf(wiki['text'])
    
    In [5]:
    def sframe_to_scipy(x, column_name):
        '''
        Convert a dictionary column of an SFrame into a sparse matrix format where
        each (row_id, column_id, value) triple corresponds to the value of
        x[row_id][column_id], where column_id is a key in the dictionary.
           
        Example
        >>> sparse_matrix, map_key_to_index = sframe_to_scipy(sframe, column_name)
        '''
        assert type(x[column_name][0]) == dict, \
            'The chosen column must be dict type, representing sparse data.'
        
        # 1. Add a row number (id)
        x = x.add_row_number()
    
        # 2. Stack will transform x to have a row for each unique (row, key) pair.
        x = x.stack(column_name, ['feature', 'value'])
    
        # Map feature words to integers 
        unique_words = sorted(x['feature'].unique())
        mapping = {word:i for i, word in enumerate(unique_words)}
        x['feature_id'] = x['feature'].apply(lambda x: mapping[x])
    
        # Create numpy arrays that contain the data for the sparse matrix.
        row_id = np.array(x['id'])
        col_id = np.array(x['feature_id'])
        data = np.array(x['value'])
        
        width = x['id'].max() + 1
        height = x['feature_id'].max() + 1
        
        # Create a sparse matrix.
        mat = csr_matrix((data, (row_id, col_id)), shape=(width, height))
        return mat, mapping
    
    In [6]:
    %%time
    # The conversion will take about a minute or two.
    tf_idf, map_index_to_word = sframe_to_scipy(wiki, 'tf_idf')
    
    Using default 16 lambda workers.
    To maximize the degree of parallelism, add the following code to the beginning of the program:
    "turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)"
    Note that increasing the degree of parallelism also increases the memory footprint.
    CPU times: user 3min 7s, sys: 13.5 s, total: 3min 21s
    Wall time: 3min 13s
    
    In [7]:
    tf_idf.shape
    
    Out[7]:
    (59071, 547979)
    In [8]:
    from sklearn.preprocessing import normalize
    tf_idf = normalize(tf_idf)
    
    In [9]:
    def get_initial_centroids(data, k, seed=None):
        '''Randomly choose k data points as initial centroids'''
        if seed is not None: # useful for obtaining consistent results
            np.random.seed(seed)
        n = data.shape[0] # number of data points
            
        # Pick K indices from range [0, N).
        rand_indices = np.random.randint(0, n, k)
        
        # Keep centroids as dense format, as many entries will be nonzero due to averaging.
        # As long as at least one document in a cluster contains a word,
        # it will carry a nonzero weight in the TF-IDF vector of the centroid.
        centroids = data[rand_indices,:].toarray()
        
        return centroids
    
    In [10]:
    from sklearn.metrics import pairwise_distances
    
    # Get the TF-IDF vectors for documents 100 through 102.
    queries = tf_idf[100:102,:]
    
    # Compute pairwise distances from every data point to each query vector.
    dist = pairwise_distances(tf_idf, queries, metric='euclidean')
    
    print(dist)
    
    [[1.41000789 1.36894636]
     [1.40935215 1.41023886]
     [1.39855967 1.40890299]
     ...
     [1.41108296 1.39123646]
     [1.41022804 1.31468652]
     [1.39899784 1.41072448]]
    
    In [11]:
    k = 3
    centroids = tf_idf[:k,:]
    distances = pairwise_distances(tf_idf, centroids, metric='euclidean')
    print(distances)
    dist = pairwise_distances(tf_idf[430,:], centroids[1], metric='euclidean')
    print(dist)
    
    [[0.         1.40775177 1.38784582]
     [1.40775177 0.         1.39867641]
     [1.38784582 1.39867641 0.        ]
     ...
     [1.37070999 1.40978937 1.40616385]
     [1.35214578 1.41306211 1.40869799]
     [1.40799024 1.41353429 1.40903605]]
    [[1.40713107]]
    
    In [12]:
    '''Test cell'''
    if np.allclose(dist, pairwise_distances(tf_idf[430,:], tf_idf[1,:])):
        print('Pass')
    else:
        print('Check your code again')
    
    Pass
    
    In [13]:
    closest_cluster = np.argmin(distances, 1)
    closest_cluster
    
    Out[13]:
    array([0, 1, 2, ..., 0, 0, 0])
    In [14]:
    '''Test cell'''
    reference = [list(row).index(min(row)) for row in distances]
    if np.allclose(closest_cluster, reference):
        print('Pass')
    else:
        print('Check your code again')
    
    Pass
    
    In [15]:
    def get_cluster_assignments(data, centroids):
        distances = pairwise_distances(data, centroids, metric='euclidean')
        return np.argmin(distances, 1)
    
    cluster_assignment = get_cluster_assignments(tf_idf, centroids)
    
    In [16]:
    if len(cluster_assignment)==59071 and \
       np.array_equal(np.bincount(cluster_assignment), np.array([23061, 10086, 25924])):
        print('Pass') # count number of data points for each cluster
    else:
        print('Check your code again.')
    
    Pass
    
    In [17]:
    def assign_clusters(data, centroids):
        
        # Compute distances between each data point and the set of centroids:
        # Fill in the blank (RHS only)
        #distances_from_centroids = ...   # YOUR CODE HERE
        
        # Compute cluster assignments for each data point:
        # Fill in the blank (RHS only)
        cluster_assignment = get_cluster_assignments(data, centroids)   # YOUR CODE HERE
        
        return cluster_assignment
    
    In [18]:
    if np.allclose(assign_clusters(tf_idf[0:100:10], tf_idf[0:8:2]), np.array([0, 1, 1, 0, 0, 2, 0, 2, 2, 1])):
        print('Pass')
    else:
        print('Check your code again.')
    
    Pass
    
    In [19]:
    data = np.array([[1., 2., 0.],
                     [0., 0., 0.],
                     [2., 2., 0.]])
    centroids = np.array([[0.5, 0.5, 0.],
                          [0., -0.5, 0.]])
    
    In [20]:
    cluster_assignment = assign_clusters(data, centroids)
    print(cluster_assignment)
    
    [0 1 0]
    
    In [21]:
    cluster_assignment==1
    
    Out[21]:
    array([False,  True, False])
    In [22]:
    cluster_assignment==0
    
    Out[22]:
    array([ True, False,  True])
    In [23]:
    data[cluster_assignment==1]
    
    Out[23]:
    array([[0., 0., 0.]])
    In [24]:
    data[cluster_assignment==0]
    
    Out[24]:
    array([[1., 2., 0.],
           [2., 2., 0.]])
    In [25]:
    data[cluster_assignment==0].mean(axis=0)
    
    Out[25]:
    array([1.5, 2. , 0. ])
    In [26]:
    def revise_centroids(data, k, cluster_assignment):
        new_centroids = []
        for i in range(k):
            # Select all data points that belong to cluster i. Fill in the blank (RHS only)
            member_data_points = data[cluster_assignment == i]   # YOUR CODE HERE
            # Compute the mean of the data points. Fill in the blank (RHS only)
            centroid = np.mean(member_data_points, axis=0)   # YOUR CODE HERE
            
            # Convert numpy.matrix type to numpy.ndarray type
            centroid = centroid.A1
            new_centroids.append(centroid)
        new_centroids = np.array(new_centroids)
        
        return new_centroids
    
    In [27]:
    result = revise_centroids(tf_idf[0:100:10], 3, np.array([0, 1, 1, 0, 0, 2, 0, 2, 2, 1]))
    if np.allclose(result[0], np.mean(tf_idf[[0,30,40,60]].toarray(), axis=0)) and \
       np.allclose(result[1], np.mean(tf_idf[[10,20,90]].toarray(), axis=0))   and \
       np.allclose(result[2], np.mean(tf_idf[[50,70,80]].toarray(), axis=0)):
        print('Pass')
    else:
        print('Check your code')
    
    Pass
    
    In [28]:
    def compute_heterogeneity(data, k, centroids, cluster_assignment):
        
        heterogeneity = 0.0
        for i in range(k):
            
            # Select all data points that belong to cluster i. Fill in the blank (RHS only)
            member_data_points = data[cluster_assignment==i, :]
            
            if member_data_points.shape[0] > 0: # check if i-th cluster is non-empty
                # Compute distances from centroid to data points (RHS only)
                distances = pairwise_distances(member_data_points, [centroids[i]], metric='euclidean')
                squared_distances = distances**2
                heterogeneity += np.sum(squared_distances)
            
        return heterogeneity
    
    In [29]:
    compute_heterogeneity(data, 2, centroids, cluster_assignment)
    
    Out[29]:
    7.25
    In [30]:
    # Fill in the blanks
    def kmeans(data, k, initial_centroids, maxiter, record_heterogeneity=None, verbose=False):
        '''This function runs k-means on given data and initial set of centroids.
           maxiter: maximum number of iterations to run.
           record_heterogeneity: (optional) a list, to store the history of heterogeneity as function of iterations
                                 if None, do not store the history.
           verbose: if True, print how many data points changed their cluster labels in each iteration'''
        centroids = initial_centroids[:]
        prev_cluster_assignment = None
        
        for itr in range(maxiter):        
            if verbose:
                print(itr)
            
            # 1. Make cluster assignments using nearest centroids
            # YOUR CODE HERE
            cluster_assignment = assign_clusters(data, centroids)
                
            # 2. Compute a new centroid for each of the k clusters, averaging all data points assigned to that cluster.
            # YOUR CODE HERE
            centroids = revise_centroids(data, k, cluster_assignment)
                
            # Check for convergence: if none of the assignments changed, stop
            if prev_cluster_assignment is not None and \
              (prev_cluster_assignment==cluster_assignment).all():
                break
            
            # Print number of new assignments 
            if prev_cluster_assignment is not None:
                num_changed = np.sum(prev_cluster_assignment!=cluster_assignment)
                if verbose:
                    print('    {0:5d} elements changed their cluster assignment.'.format(num_changed))   
            
            # Record heterogeneity convergence metric
            if record_heterogeneity is not None:
                # YOUR CODE HERE
                score = compute_heterogeneity(data, k, centroids, cluster_assignment)
                record_heterogeneity.append(score)
            
            prev_cluster_assignment = cluster_assignment[:]
            
        return centroids, cluster_assignment
    
    In [31]:
    def plot_heterogeneity(heterogeneity, k):
        plt.figure(figsize=(7,4))
        plt.plot(heterogeneity, linewidth=4)
        plt.xlabel('# Iterations')
        plt.ylabel('Heterogeneity')
        plt.title('Heterogeneity of clustering over time, K={0:d}'.format(k))
        plt.rcParams.update({'font.size': 16})
        plt.tight_layout()
    
    In [41]:
    k = 3
    heterogeneity = []
    initial_centroids = get_initial_centroids(tf_idf, k, seed=0)
    centroids, cluster_assignment = kmeans(tf_idf, k, initial_centroids, maxiter=400,
                                           record_heterogeneity=heterogeneity, verbose=True)
    plot_heterogeneity(heterogeneity, k)
    
    0
    1
        19157 elements changed their cluster assignment.
    2
         7739 elements changed their cluster assignment.
    3
         5119 elements changed their cluster assignment.
    4
         3370 elements changed their cluster assignment.
    5
         2811 elements changed their cluster assignment.
    6
         3233 elements changed their cluster assignment.
    7
         3815 elements changed their cluster assignment.
    8
         3172 elements changed their cluster assignment.
    9
         1149 elements changed their cluster assignment.
    10
          498 elements changed their cluster assignment.
    11
          265 elements changed their cluster assignment.
    12
          149 elements changed their cluster assignment.
    13
          100 elements changed their cluster assignment.
    14
           76 elements changed their cluster assignment.
    15
           67 elements changed their cluster assignment.
    16
           51 elements changed their cluster assignment.
    17
           47 elements changed their cluster assignment.
    18
           40 elements changed their cluster assignment.
    19
           34 elements changed their cluster assignment.
    20
           35 elements changed their cluster assignment.
    21
           39 elements changed their cluster assignment.
    22
           24 elements changed their cluster assignment.
    23
           16 elements changed their cluster assignment.
    24
           12 elements changed their cluster assignment.
    25
           14 elements changed their cluster assignment.
    26
           17 elements changed their cluster assignment.
    27
           15 elements changed their cluster assignment.
    28
           14 elements changed their cluster assignment.
    29
           16 elements changed their cluster assignment.
    30
           21 elements changed their cluster assignment.
    31
           22 elements changed their cluster assignment.
    32
           33 elements changed their cluster assignment.
    33
           35 elements changed their cluster assignment.
    34
           39 elements changed their cluster assignment.
    35
           36 elements changed their cluster assignment.
    36
           36 elements changed their cluster assignment.
    37
           25 elements changed their cluster assignment.
    38
           27 elements changed their cluster assignment.
    39
           25 elements changed their cluster assignment.
    40
           28 elements changed their cluster assignment.
    41
           35 elements changed their cluster assignment.
    42
           31 elements changed their cluster assignment.
    43
           25 elements changed their cluster assignment.
    44
           18 elements changed their cluster assignment.
    45
           15 elements changed their cluster assignment.
    46
           10 elements changed their cluster assignment.
    47
            8 elements changed their cluster assignment.
    48
            8 elements changed their cluster assignment.
    49
            8 elements changed their cluster assignment.
    50
            7 elements changed their cluster assignment.
    51
            8 elements changed their cluster assignment.
    52
            3 elements changed their cluster assignment.
    53
            3 elements changed their cluster assignment.
    54
            4 elements changed their cluster assignment.
    55
            2 elements changed their cluster assignment.
    56
            3 elements changed their cluster assignment.
    57
            3 elements changed their cluster assignment.
    58
            1 elements changed their cluster assignment.
    59
            1 elements changed their cluster assignment.
    60
    
    In [42]:
    np.bincount(cluster_assignment)
    
    Out[42]:
    array([19595, 10427, 29049])
    In [34]:
    k = 10
    heterogeneity = {}
    cluster_assignment_dict = {}
    import time
    start = time.time()
    for seed in [0, 20000, 40000, 60000, 80000, 100000, 120000]:
        initial_centroids = get_initial_centroids(tf_idf, k, seed)
        centroids, cluster_assignment = kmeans(tf_idf, k, initial_centroids, maxiter=400,
                                               record_heterogeneity=None, verbose=False)
        # To save time, compute heterogeneity only once in the end
        heterogeneity[seed] = compute_heterogeneity(tf_idf, k, centroids, cluster_assignment)
    
        # This is the line we added for the next quiz question
        cluster_assignment_dict[seed] = np.bincount(cluster_assignment)
        
    #    print('seed={0:06d}, heterogeneity={1:.5f}'.format(seed, heterogeneity[seed]))
        # And this is the modified print statement
        print('seed={0:06d}, heterogeneity={1:.5f}, cluster_distribution={2}'.format(seed, heterogeneity[seed], 
                                               cluster_assignment_dict[seed]))
        sys.stdout.flush()
    end = time.time()
    print(end-start)
    
    seed=000000, heterogeneity=57457.52442, cluster_distribution=[18047  3824  5671  6983  1492  1730  3882  3449  7139  6854]
    seed=020000, heterogeneity=57533.20100, cluster_distribution=[ 3142   768  3566  2277 15779  7278  6146  7964  6666  5485]
    seed=040000, heterogeneity=57512.69257, cluster_distribution=[ 5551  6623   186  2999  8487  3893  6807  2921  3472 18132]
    seed=060000, heterogeneity=57466.97925, cluster_distribution=[ 3014  3089  6681  3856  8080  7222  3424   424  5381 17900]
    seed=080000, heterogeneity=57494.92990, cluster_distribution=[17582  1785  7215  3314  6285   809  5930  6791  5536  3824]
    seed=100000, heterogeneity=57484.42210, cluster_distribution=[ 6618  1337  6191  2890 16969  4983  5242  3892  5562  5387]
    seed=120000, heterogeneity=57554.62410, cluster_distribution=[ 6118  5841  4964  8423  4302  3183 16481  1608  5524  2627]
    129.29057931900024
    
    In [35]:
    def smart_initialize(data, k, seed=None):
        '''Use k-means++ to initialize a good set of centroids'''
        if seed is not None: # useful for obtaining consistent results
            np.random.seed(seed)
        centroids = np.zeros((k, data.shape[1]))
        
        # Randomly choose the first centroid.
        # Since we have no prior knowledge, choose uniformly at random
        idx = np.random.randint(data.shape[0])
        centroids[0] = data[idx,:].toarray()
        # Compute distances from the first centroid chosen to all the other data points
        squared_distances = pairwise_distances(data, centroids[0:1], metric='euclidean').flatten()**2
        
        for i in range(1, k):
            # Choose the next centroid randomly, so that the probability for each data point to be chosen
            # is directly proportional to its squared distance from the nearest centroid.
            # Roughtly speaking, a new centroid should be as far as from ohter centroids as possible.
            idx = np.random.choice(data.shape[0], 1, p=squared_distances/sum(squared_distances))
            centroids[i] = data[idx,:].toarray()
            # Now compute distances from the centroids to all data points
            squared_distances = np.min(pairwise_distances(data, centroids[0:i+1], metric='euclidean')**2,axis=1)
        
        return centroids
    
    In [36]:
    %%time 
    
    k = 10
    heterogeneity_smart = {}
    seeds = [0, 20000, 40000, 60000, 80000, 100000, 120000]
    for seed in seeds:
        initial_centroids = smart_initialize(tf_idf, k, seed)
        centroids, cluster_assignment = kmeans(tf_idf, k, initial_centroids, maxiter=400,
                                               record_heterogeneity=None, verbose=False)
        # To save time, compute heterogeneity only once in the end
        heterogeneity_smart[seed] = compute_heterogeneity(tf_idf, k, centroids, cluster_assignment)
        print('seed={0:06d}, heterogeneity={1:.5f}'.format(seed, heterogeneity_smart[seed]))
        sys.stdout.flush()
    
    seed=000000, heterogeneity=57468.63808
    seed=020000, heterogeneity=57486.94263
    seed=040000, heterogeneity=57454.35926
    seed=060000, heterogeneity=57530.43659
    seed=080000, heterogeneity=57454.51852
    seed=100000, heterogeneity=57471.56674
    seed=120000, heterogeneity=57523.28839
    CPU times: user 2min 34s, sys: 4.72 s, total: 2min 39s
    Wall time: 2min 39s
    
    In [37]:
    plt.figure(figsize=(8,5))
    plt.boxplot([list(heterogeneity.values()), list(heterogeneity_smart.values())], vert=False)
    plt.yticks([1, 2], ['k-means', 'k-means++'])
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()
    
    In [38]:
    def kmeans_multiple_runs(data, k, maxiter, num_runs, seed_list=None, verbose=False):
        heterogeneity = {}
        
        min_heterogeneity_achieved = float('inf')
        best_seed = None
        final_centroids = None
        final_cluster_assignment = None
        
        for i in range(num_runs):
            
            # Use UTC time if no seeds are provided 
            if seed_list is not None: 
                seed = seed_list[i]
                np.random.seed(seed)
            else: 
                seed = int(time.time())
                np.random.seed(seed)
            
            # Use k-means++ initialization
            # YOUR CODE HERE
            initial_centroids = smart_initialize(data, k, seed)
            
            # Run k-means
            # YOUR CODE HERE
            centroids, cluster_assignment = kmeans(tf_idf, k, initial_centroids, maxiter=400,
                                               record_heterogeneity=None, verbose=False)
            
            # To save time, compute heterogeneity only once in the end
            # YOUR CODE HERE
            heterogeneity[seed] = compute_heterogeneity(tf_idf, k, centroids, cluster_assignment)
            
            if verbose:
                print('seed={0:06d}, heterogeneity={1:.5f}'.format(seed, heterogeneity[seed]))
                sys.stdout.flush()
            
            # if current measurement of heterogeneity is lower than previously seen,
            # update the minimum record of heterogeneity.
            if heterogeneity[seed] < min_heterogeneity_achieved:
                min_heterogeneity_achieved = heterogeneity[seed]
                best_seed = seed
                final_centroids = centroids
                final_cluster_assignment = cluster_assignment
        
        # Return the centroids and cluster assignments that minimize heterogeneity.
        return final_centroids, final_cluster_assignment
    
    In [39]:
    %%time
    import numpy as np 
    
    def plot_k_vs_heterogeneity(k_values, heterogeneity_values):
        plt.figure(figsize=(7,4))
        plt.plot(k_values, heterogeneity_values, linewidth=4)
        plt.xlabel('K')
        plt.ylabel('Heterogeneity')
        plt.title('K vs. Heterogeneity')
        plt.rcParams.update({'font.size': 16})
        plt.tight_layout()
    
    centroids = {}
    cluster_assignment = {}
    heterogeneity_values = []
    k_list = [2, 10, 25, 50, 100]
    #seed_list = [0]
    # Uncomment the following line to run the plot with all the seeds (it may take about an hour to finish).
    seed_list = [0, 20000, 40000, 60000, 80000, 100000, 120000]
    
    for k in k_list:
        heterogeneity = []
        centroids[k], cluster_assignment[k] = kmeans_multiple_runs(tf_idf, k, maxiter=400,
                                                                   num_runs=len(seed_list),                                                               seed_list=seed_list,
                                                                   verbose=True)
        score = compute_heterogeneity(tf_idf, k, centroids[k], cluster_assignment[k])
        heterogeneity_values.append(score)
    
    seed=000000, heterogeneity=58224.59913
    seed=020000, heterogeneity=58179.57453
    seed=040000, heterogeneity=58179.57453
    seed=060000, heterogeneity=58179.57453
    seed=080000, heterogeneity=58224.59952
    seed=100000, heterogeneity=58179.57453
    seed=120000, heterogeneity=58179.57453
    seed=000000, heterogeneity=57468.63808
    seed=020000, heterogeneity=57486.94263
    seed=040000, heterogeneity=57454.35926
    seed=060000, heterogeneity=57530.43659
    seed=080000, heterogeneity=57454.51852
    seed=100000, heterogeneity=57471.56674
    seed=120000, heterogeneity=57523.28839
    seed=000000, heterogeneity=56913.24052
    seed=020000, heterogeneity=56961.01793
    seed=040000, heterogeneity=56904.99744
    seed=060000, heterogeneity=56858.67830
    seed=080000, heterogeneity=56955.74619
    seed=100000, heterogeneity=56973.02116
    seed=120000, heterogeneity=56934.20148
    seed=000000, heterogeneity=56399.72145
    seed=020000, heterogeneity=56322.64583
    seed=040000, heterogeneity=56314.32239
    seed=060000, heterogeneity=56278.53939
    seed=080000, heterogeneity=56353.54891
    seed=100000, heterogeneity=56303.94021
    seed=120000, heterogeneity=56361.37319
    seed=000000, heterogeneity=55649.66538
    seed=020000, heterogeneity=55587.56988
    seed=040000, heterogeneity=55720.24668
    seed=060000, heterogeneity=55616.64653
    seed=080000, heterogeneity=55672.95812
    seed=100000, heterogeneity=55660.45384
    seed=120000, heterogeneity=55735.28103
    CPU times: user 33min 27s, sys: 3min 48s, total: 37min 16s
    Wall time: 37min 10s
    
    In [40]:
    plot_k_vs_heterogeneity(k_list, heterogeneity_values)
    
    In [69]:
    def visualize_document_clusters(wiki, tf_idf, centroids, cluster_assignment, k, map_word_to_index, display_content=True):
        '''wiki: original dataframe
           tf_idf: data matrix, sparse matrix format
           map_index_to_word: SFrame specifying the mapping betweeen words and column indices
           display_content: if True, display 8 nearest neighbors of each centroid'''
        map_index_to_word =  {v:k for k,v in map_word_to_index.items()}
        print('==========================================================')
        # Visualize each cluster c
        for c in range(k):
            # Cluster heading
            print('Cluster {0:d}    '.format(c)),
            # Print top 5 words with largest TF-IDF weights in the cluster
            idx = centroids[c].argsort()[::-1]
            for i in range(5): # Print each word along with the TF-IDF weight
                print('{0:s}:{1:.3f}'.format(map_index_to_word[idx[i]], centroids[c][idx[i]])),
            print('')
            
            if display_content:
                # Compute distances from the centroid to all data points in the cluster,
                # and compute nearest neighbors of the centroids within the cluster.
                distances = pairwise_distances(tf_idf, centroids[c].reshape(1, -1), metric='euclidean').flatten()
                distances[cluster_assignment!=c] = float('inf') # remove non-members from consideration
                
                nearest_neighbors = distances.argsort()
                
                # For 8 nearest neighbors, print the title as well as first 180 characters of text.
                # Wrap the text at 80-character mark.
                for i in range(8):
                    text = ' '.join(wiki[nearest_neighbors[i]]['text'].split(None, 25)[0:25])
                    print('\n* {0:50s} {1:.5f}\n  {2:s}\n  {3:s}'.format(wiki[nearest_neighbors[i]]['name'],
                        distances[nearest_neighbors[i]], text[:90], text[90:180] if len(text) > 90 else ''))
            print('==========================================================')
    
    In [70]:
    '''Notice the extra pairs of parentheses for centroids and cluster_assignment.
       The centroid and cluster_assignment are still inside the npz file,
       and we need to explicitly indicate when to load them into memory.'''
    visualize_document_clusters(wiki, tf_idf, centroids[2], cluster_assignment[2], 2, map_index_to_word)
    
    ==========================================================
    Cluster 0    
    she:0.021
    university:0.015
    her:0.013
    he:0.012
    served:0.010
    
    
    * Kayee Griffin                                      0.97358
      kayee frances griffin born 6 february 1950 is an australian politician and former australi
      an labor party member of the new south wales legislative council serving
    
    * %C3%81ine Hyland                                   0.97370
      ine hyland ne donlon is emeritus professor of education and former vicepresident of univer
      sity college cork ireland she was born in 1942 in athboy co
    
    * Christine Robertson                                0.97373
      christine mary robertson born 5 october 1948 is an australian politician and former austra
      lian labor party member of the new south wales legislative council serving
    
    * Anita Kunz                                         0.97471
      anita e kunz oc born 1956 is a canadianborn artist and illustratorkunz has lived in london
       new york and toronto contributing to magazines and working
    
    * Barry Sullivan (lawyer)                            0.97488
      barry sullivan is a chicago lawyer and as of july 1 2009 the cooney conway chair in advoca
      cy at loyola university chicago school of law
    
    * Margaret Catley-Carlson                            0.97534
      margaret catleycarlson oc born 6 october 1942 is a canadian civil servant she was chair an
      d is now a patron of the global water partnership
    
    * Vanessa Gilmore                                    0.97579
      vanessa diane gilmore born october 1956 is a judge on the united states district court for
       the southern district of texas she was appointed to
    
    * James A. Joseph                                    0.97624
      james a joseph born 1935 is an american former diplomatjoseph is professor of the practice
       of public policy studies at duke university and founder of
    ==========================================================
    Cluster 1    
    she:0.023
    music:0.017
    her:0.017
    league:0.016
    season:0.016
    
    
    * Patricia Scott                                     0.97143
      patricia scott pat born july 14 1929 is a former pitcher who played in the allamerican gir
      ls professional baseball league for parts of four seasons
    
    * Madonna (entertainer)                              0.97181
      madonna louise ciccone tkoni born august 16 1958 is an american singer songwriter actress 
      and businesswoman she achieved popularity by pushing the boundaries of lyrical
    
    * Janet Jackson                                      0.97257
      janet damita jo jackson born may 16 1966 is an american singer songwriter and actress know
      n for a series of sonically innovative socially conscious and
    
    * Natashia Williams                                  0.97343
      natashia williamsblach born august 2 1978 is an american actress and former wonderbra camp
      aign model who is perhaps best known for her role as shane
    
    * Todd Williams                                      0.97384
      todd michael williams born february 13 1971 in syracuse new york is a former major league 
      baseball relief pitcher he attended east syracuseminoa high school
    
    * Marilyn Jenkins                                    0.97430
      marilyn a jenkins jenks born september 18 1934 is a former catcher who played in the allam
      erican girls professional baseball league listed at 5 ft
    
    * Kayla Bashore Smedley                              0.97496
      kayla bashore born february 20 1983 in daegu south korea is an american field hockey defen
      der and midfielder now living in san diego california she
    
    * Cher                                               0.97510
      cher r born cherilyn sarkisian may 20 1946 is an american singer actress and television ho
      st described as embodying female autonomy in a maledominated industry
    ==========================================================
    
    In [71]:
    k = 10
    visualize_document_clusters(wiki, tf_idf, centroids[k], cluster_assignment[k], k, map_index_to_word)
    
    ==========================================================
    Cluster 0    
    he:0.012
    art:0.011
    his:0.009
    book:0.008
    that:0.008
    
    
    * Wilson McLean                                      0.97661
      wilson mclean born 1937 is a scottish illustrator and artist he has illustrated primarily 
      in the field of advertising but has also provided cover art
    
    * Tang Xiyang                                        0.97988
      tang xiyang born january 30 1930 in miluo hunan province is a chinese environmentalist he 
      was awarded the 2007 ramon magsaysay award for peace and
    
    * David Salle                                        0.98168
      david salle born 1952 is an american painter printmaker and stage designer who helped defi
      ne postmodern sensibility salle was born in norman oklahoma he earned
    
    * Alberto Blanco (poet)                              0.98172
      alberto blanco is considered one of mexicos most important poets born in mexico city on fe
      bruary 18 1951 he spent his childhood and adolescence in
    
    * John Donald (jewellery designer)                   0.98290
      john donald is a british jeweller designer whose work is strongly identified in the 1960s 
      and 1970s in london princess margaret and the queen mother
    
    * David Elliott (curator)                            0.98298
      david stuart elliott born 29 april 1949 is a britishborn art gallery and museum curator an
      d writer about modern and contemporary arthe was educated at
    
    * Chris Hunt                                         0.98307
      chris hunt is a british journalist magazine editor and author he has worked in journalism 
      for over twenty years most often writing about football or
    
    * Kcho                                               0.98342
      kchosometimes spelled kcho born alexis leiva machado on the isla de pinos 1970 is a contem
      porary cuban artist kcho has had art showings around the
    ==========================================================
    Cluster 1    
    film:0.088
    theatre:0.037
    films:0.032
    television:0.028
    actor:0.027
    
    
    * Shona Auerbach                                     0.93531
      shona auerbach is a british film director and cinematographerauerbach began her career as 
      a stills photographer she studied film at manchester university and cinematography at
    
    * Singeetam Srinivasa Rao                            0.93748
      singeetam srinivasa rao born 21 september 1931 is an indian film director producer screenw
      riter composer singer lyricist and actor known for his works in telugu
    
    * Justin Edgar                                       0.93801
      justin edgar is a british film directorborn in handsworth birmingham on 18 august 1971 edg
      ar graduated from portsmouth university in 1996 with a first class
    
    * Laura Neri                                         0.94151
      laura neri greek is a director of greek and italian origins born in brussels belgium she g
      raduated from the usc school of cinematic arts in
    
    * Bill Bennett (director)                            0.94260
      bill bennett born 1953 is an australian film director producer and screenwriterhe dropped 
      out of medicine at queensland university in 1972 and joined the australian
    
    * Robert Braiden                                     0.94344
      robert braiden is an australian film director and writer born in sydney he grew up in moor
      ebank liverpool new south wales and now currently lives
    
    * Nitzan Gilady                                      0.94369
      nitzan gilady also known as nitzan giladi hebrew is an israeli film director who has writt
      en produced and directed the documentary films in satmar custody
    
    * Robb Moss                                          0.94484
      robb moss is an independent documentary filmmaker and professor at harvard university nota
      ble work includes such films as the same river twice secrecy film and
    ==========================================================
    Cluster 2    
    league:0.061
    baseball:0.048
    season:0.046
    coach:0.042
    games:0.034
    
    
    * Todd Williams                                      0.92759
      todd michael williams born february 13 1971 in syracuse new york is a former major league 
      baseball relief pitcher he attended east syracuseminoa high school
    
    * Justin Knoedler                                    0.93295
      justin joseph knoedler born july 17 1980 in springfield illinois is a former major league 
      baseball catcherknoedler was originally drafted by the st louis cardinals
    
    * Kevin Nicholson (baseball)                         0.93579
      kevin ronald nicholson born march 29 1976 is a canadian baseball shortstop he played part 
      of the 2000 season for the san diego padres of
    
    * Dave Ford                                          0.93642
      david alan ford born december 29 1956 is a former major league baseball pitcher for the ba
      ltimore orioles born in cleveland ohio ford attended lincolnwest
    
    * Steve Springer                                     0.93649
      steven michael springer born february 11 1961 is an american former professional baseball 
      player who appeared in major league baseball as a third baseman and
    
    * Chris Young (pitcher)                              0.93772
      christopher ryan chris young born may 25 1979 is an american professional baseball rightha
      nded pitcher who is a free agent he made his major league
    
    * Eric Fox                                           0.93875
      eric hollis fox born august 15 1963 in lemoore california is an american professional base
      ball coach the 5 ft 10 in 178 m 180 lb
    
    * Ted Silva                                          0.93948
      theodore a silva born august 4 1974 in inglewood california has held numerous roles in ama
      teur and professional baseball he has played in the minor
    ==========================================================
    Cluster 3    
    party:0.046
    election:0.042
    minister:0.039
    elected:0.028
    member:0.020
    
    
    * Stephen Harper                                     0.95128
      stephen joseph harper pc mp born april 30 1959 is a canadian politician who is the 22nd an
      d current prime minister of canada and the
    
    * Lucienne Robillard                                 0.95307
      lucienne robillard pc born june 16 1945 is a canadian politician and a member of the liber
      al party of canada she sat in the house
    
    * Marcelle Mersereau                                 0.95379
      marcelle mersereau born february 14 1942 in pointeverte new brunswick is a canadian politi
      cian a civil servant for most of her career she also served
    
    * Maureen Lyster                                     0.95450
      maureen anne lyster born 10 september 1943 is an australian politician she was an australi
      an labor party member of the victorian legislative assembly from 1985
    
    * Bruce Flegg                                        0.95554
      dr bruce stephen flegg born 10 march 1954 in sydney is an australian former politician he 
      was a member of the queensland legislative assembly from
    
    * Doug Lewis                                         0.95583
      douglas grinslade doug lewis pc qc born april 17 1938 is a former canadian politician a ch
      artered accountant and lawyer by training lewis entered the
    
    * Paul Martin                                        0.95583
      paul edgar philippe martin pc cc born august 28 1938 also known as paul martin jr is a can
      adian politician who was the 21st prime
    
    * Gordon Gibson                                      0.95612
      gordon gibson obc born 1937 is a political columnist author and former politician in briti
      sh columbia bc canada he is the son of the late
    ==========================================================
    Cluster 4    
    music:0.095
    orchestra:0.087
    symphony:0.057
    opera:0.050
    conductor:0.041
    
    
    * Heiichiro Ohyama                                   0.89162
      heiichiro ohyama yama heiichir born 1947 in kyoto japan is a japanese conductor and violis
      the has a longestablished reputation as a remarkable conductor and one
    
    * Brenton Broadstock                                 0.90284
      brenton broadstock ao born 1952 is an australian composerbroadstock was born in melbourne 
      he studied history politics and music at monash university and later composition
    
    * Toshiyuki Shimada                                  0.90301
      toshiyuki shimada is a japanese american orchestral conductor he is music director of both
       the eastern connecticut symphony orchestra in new london ct and the
    
    * David Porcelijn                                    0.90357
      david porcelijn born 7 january 1947 in achtkarspelen is a dutch composer and conductordavi
      d porcelijn studied flute composition and conducting at the royal conservatoire of
    
    * Hugh Wolff                                         0.90731
      hugh wolff born 21 october 1953 in paris is an american conductorhe was born in paris whil
      e his father was serving in the u s
    
    * Daniel Meyer (conductor)                           0.90849
      daniel meyer was born in cleveland ohio and has been conductor and musical director of sev
      eral prominent american orchestrashe is a graduate of denison university
    
    * Paul Hostetter                                     0.90967
      paul hostetter is the ethel foley distinguished chair in orchestral activities for the sch
      wob school of music at columbus state university the conductor and artistic
    
    * Peter Ruzicka                                      0.91097
      peter ruzicka born july 3 1948 is a german composer and conductor of classical musicpeter 
      ruzicka was born in dsseldorf on july 3 1948 he
    ==========================================================
    Cluster 5    
    she:0.140
    her:0.088
    miss:0.012
    actress:0.011
    womens:0.011
    
    
    * Lauren Royal                                       0.93427
      lauren royal born march 3 circa 1965 is a book writer from california royal has written bo
      th historic and novelistic booksa selfproclaimed angels baseball fan
    
    * Janine Shepherd                                    0.93681
      janine lee shepherd am born 1962 is an australian pilot and former crosscountry skier shep
      herds career as an athlete ended when she suffered major injuries
    
    * Barbara Hershey                                    0.93708
      barbara hershey born barbara lynn herzstein february 5 1948 once known as barbara seagull 
      is an american actress in a career spanning nearly 50 years
    
    * Janet Jackson                                      0.93735
      janet damita jo jackson born may 16 1966 is an american singer songwriter and actress know
      n for a series of sonically innovative socially conscious and
    
    * Ellina Graypel                                     0.93837
      ellina graypel born july 19 1972 is an awardwinning russian singersongwriter she was born 
      near the volga river in the heart of russia she spent
    
    * Alexandra Potter                                   0.93867
      alexandra potter born 1970 is a british author of romantic comediesborn in bradford yorksh
      ire england and educated at liverpool university gaining an honors degree in
    
    * Dorothy E. Smith                                   0.93904
      dorothy edith smithborn july 6 1926 is a canadian sociologist with research interests besi
      des in sociology in many disciplines including womens studies psychology and educational
    
    * Jane Fonda                                         0.93914
      jane fonda born lady jayne seymour fonda december 21 1937 is an american actress writer po
      litical activist former fashion model and fitness guru she is
    ==========================================================
    Cluster 6    
    album:0.053
    band:0.044
    music:0.042
    released:0.028
    jazz:0.023
    
    
    * Will.i.am                                          0.95336
      william adams born march 15 1975 known by his stage name william pronounced will i am is a
      n american rapper songwriter entrepreneur actor dj record
    
    * Tony Mills (musician)                              0.95359
      tony mills born 7 july 1962 in solihull england is an english rock singer best known for h
      is work with shy and tnthailing from birmingham
    
    * Keith Urban                                        0.95379
      keith lionel urban born 26 october 1967 is a new zealand born australian country music sin
      ger songwriter guitarist entrepreneur and music competition judge in 1991
    
    * Prince (musician)                                  0.95420
      prince rogers nelson born june 7 1958 known by his mononym prince is an american singerson
      gwriter multiinstrumentalist and actor he has produced ten platinum albums
    
    * Steve Overland                                     0.95503
      steve overland is a british singermusician who was the lead vocalist and songwriter for th
      e bands wildlife fm the ladder shadowman and his own group
    
    * Jesse Johnson (musician)                           0.95606
      jesse woods johnson born june 1 1960 in rock island illinois is a musician best known as t
      he guitarist in the original lineup of the
    
    * Mark Cross (musician)                              0.95609
      mark cross born 2 august 1965 london is a hard rock and heavy metal drummer he was born to
       an english father and german mother
    
    * Stewart Levine                                     0.95888
      stewart levine is an american record producer he has worked with such artists as the crusa
      ders minnie riperton lionel richie simply red hugh masekela dr
    ==========================================================
    Cluster 7    
    law:0.133
    court:0.081
    judge:0.060
    district:0.042
    justice:0.040
    
    
    * Barry Sullivan (lawyer)                            0.89228
      barry sullivan is a chicago lawyer and as of july 1 2009 the cooney conway chair in advoca
      cy at loyola university chicago school of law
    
    * William G. Young                                   0.89433
      william glover young born 1940 is a united states federal judge for the district of massac
      husetts young was born in huntington new york he attended
    
    * Bernard Bell (attorney)                            0.89617
      bernard bell is the associate dean for academic affairs and faculty professor of law and h
      erbert hannoch scholar at rutgers school of lawnewark bell received
    
    * George B. Daniels                                  0.89796
      george benjamin daniels born 1953 is a united states federal judge for the united states d
      istrict court for the southern district of new yorkdaniels was
    
    * Robinson O. Everett                                0.90432
      robinson o everett march 18 1928 june 12 2009 was an american lawyer judge and a professor
       of law at duke universityeverett was born in
    
    * James G. Carr                                      0.90595
      james g carr born july 7 1940 is a federal district judge for the united states district c
      ourt for the northern district of ohiocarr was
    
    * John C. Eastman                                    0.90764
      john c eastman born april 21 1960 is a conservative american law professor and constitutio
      nal law scholar he is the henry salvatori professor of law
    
    * Jean Constance Hamilton                            0.90830
      jean constance hamilton born 1945 is a senior united states district judge of the united s
      tates district court for the eastern district of missouriborn in
    ==========================================================
    Cluster 8    
    football:0.050
    league:0.044
    club:0.043
    season:0.042
    played:0.037
    
    
    * Chris Day                                          0.93686
      christopher nicholas chris day born 28 july 1975 is an english professional footballer who
       plays as a goalkeeper for stevenageday started his career at tottenham
    
    * Jason Roberts (footballer)                         0.93775
      jason andre davis roberts mbe born 25 january 1978 is a former professional footballer and
       now a football punditborn in park royal london roberts was
    
    * Tony Smith (footballer, born 1957)                 0.93839
      anthony tony smith born 20 february 1957 is a former footballer who played as a central de
      fender in the football league in the 1970s and
    
    * Neil Grayson                                       0.94178
      neil grayson born 1 november 1964 in york is an english footballer who last played as a st
      riker for sutton towngraysons first club was local
    
    * Richard Ambrose                                    0.94220
      richard ambrose born 10 june 1972 is a former australian rules footballer who played with 
      the sydney swans in the australian football league afl he
    
    * Paul Robinson (footballer, born 1979)              0.94245
      paul william robinson born 15 october 1979 is an english professional footballer who plays
       for blackburn rovers as a goalkeeper he is a former england
    
    * Alex Lawless                                       0.94269
      alexander graham alex lawless born 26 march 1985 is a welsh professional footballer who pl
      ays for luton town as a midfielderlawless began his career with
    
    * Sol Campbell                                       0.94275
      sulzeer jeremiah sol campbell born 18 september 1974 is a former england international foo
      tballer a central defender he had a 19year career playing in the
    ==========================================================
    Cluster 9    
    research:0.038
    university:0.035
    professor:0.030
    science:0.023
    institute:0.019
    
    
    * Lawrence W. Green                                  0.95862
      lawrence w green is best known by health education researchers as the originator of the pr
      ecede model and codeveloper of the precedeproceed model which has
    
    * Timothy Luke                                       0.96028
      timothy w luke is university distinguished professor of political science in the college o
      f liberal arts and human sciences as well as program chair of
    
    * Ren%C3%A9e Fox                                     0.96119
      rene c fox a summa cum laude graduate of smith college in 1949 earned her phd in sociology
       in 1954 from radcliffe college harvard university
    
    * Francis Gavin                                      0.96213
      francis j gavin is first frank stanton chair in nuclear security policy studies and profes
      sor of political science at mit before joining mit he was
    
    * Catherine Hakim                                    0.96358
      catherine hakim born 30 may 1948 is a british sociologist who specialises in womens employ
      ment and womens issues she is currently a professorial research fellow
    
    * Daniel Berg (educator)                             0.96361
      daniel berg is a scientist educator and was the fifteenth president of rensselaer polytech
      nic institutehe was born on june 1 1929 in new york city
    
    * Georg von Krogh                                    0.96375
      georg von krogh was born in oslo norway he is a professor at eth zurich and holds the chai
      r of strategic management and innovation he
    
    * Martin Apple                                       0.96381
      martin a apple is president of the council of scientific society presidents cssp an organi
      zation of presidents of some sixty scientific federations and societies whose
    ==========================================================
    
    In [72]:
    np.bincount(cluster_assignment[10])
    
    Out[72]:
    array([19618,  3857,  4173,  5219,  1743,  6900,  5530,  1348,  4384,
            6299])
    In [75]:
    visualize_document_clusters(wiki, tf_idf, centroids[25], cluster_assignment[25], 25,
                                map_index_to_word, display_content=False) # turn off text for brevity
    
    ==========================================================
    Cluster 0    
    poetry:0.053
    novel:0.043
    book:0.042
    published:0.039
    fiction:0.034
    
    ==========================================================
    Cluster 1    
    film:0.100
    theatre:0.039
    films:0.036
    directed:0.029
    actor:0.028
    
    ==========================================================
    Cluster 2    
    law:0.143
    court:0.087
    judge:0.066
    district:0.045
    justice:0.042
    
    ==========================================================
    Cluster 3    
    republican:0.061
    senate:0.050
    district:0.044
    state:0.039
    democratic:0.037
    
    ==========================================================
    Cluster 4    
    music:0.114
    piano:0.047
    orchestra:0.039
    composition:0.038
    composer:0.034
    
    ==========================================================
    Cluster 5    
    album:0.116
    released:0.058
    her:0.056
    single:0.046
    music:0.040
    
    ==========================================================
    Cluster 6    
    music:0.055
    jazz:0.038
    album:0.028
    song:0.020
    records:0.019
    
    ==========================================================
    Cluster 7    
    board:0.028
    business:0.027
    economics:0.026
    chairman:0.025
    president:0.025
    
    ==========================================================
    Cluster 8    
    he:0.011
    his:0.009
    that:0.009
    world:0.007
    book:0.007
    
    ==========================================================
    Cluster 9    
    research:0.050
    university:0.039
    professor:0.038
    science:0.030
    institute:0.021
    
    ==========================================================
    Cluster 10    
    foreign:0.075
    ambassador:0.063
    affairs:0.057
    security:0.044
    nations:0.042
    
    ==========================================================
    Cluster 11    
    baseball:0.110
    league:0.103
    major:0.052
    games:0.047
    season:0.045
    
    ==========================================================
    Cluster 12    
    art:0.146
    museum:0.078
    gallery:0.057
    artist:0.033
    arts:0.032
    
    ==========================================================
    Cluster 13    
    air:0.028
    military:0.027
    police:0.024
    force:0.023
    commander:0.022
    
    ==========================================================
    Cluster 14    
    party:0.064
    minister:0.063
    election:0.054
    parliament:0.031
    elected:0.031
    
    ==========================================================
    Cluster 15    
    radio:0.072
    show:0.052
    news:0.051
    bbc:0.033
    television:0.030
    
    ==========================================================
    Cluster 16    
    church:0.120
    bishop:0.091
    diocese:0.044
    lds:0.044
    archbishop:0.043
    
    ==========================================================
    Cluster 17    
    opera:0.212
    ballet:0.088
    she:0.061
    la:0.035
    her:0.033
    
    ==========================================================
    Cluster 18    
    orchestra:0.203
    symphony:0.146
    conductor:0.107
    philharmonic:0.077
    music:0.076
    
    ==========================================================
    Cluster 19    
    she:0.146
    her:0.092
    miss:0.017
    actress:0.015
    women:0.011
    
    ==========================================================
    Cluster 20    
    racing:0.127
    formula:0.078
    race:0.066
    car:0.060
    driver:0.054
    
    ==========================================================
    Cluster 21    
    championships:0.057
    tour:0.055
    pga:0.041
    olympics:0.035
    metres:0.035
    
    ==========================================================
    Cluster 22    
    league:0.052
    rugby:0.049
    club:0.046
    cup:0.045
    season:0.041
    
    ==========================================================
    Cluster 23    
    band:0.104
    album:0.049
    rock:0.031
    guitar:0.031
    bands:0.030
    
    ==========================================================
    Cluster 24    
    football:0.057
    coach:0.053
    season:0.047
    basketball:0.042
    played:0.039
    
    ==========================================================
    
    In [76]:
    k=100
    visualize_document_clusters(wiki, tf_idf, centroids[k], cluster_assignment[k], k,
                                map_index_to_word, display_content=False)
    # turn off text for brevity -- turn it on if you are curious ;)
    
    ==========================================================
    Cluster 0    
    psychology:0.195
    psychological:0.066
    research:0.057
    psychologist:0.045
    cognitive:0.041
    
    ==========================================================
    Cluster 1    
    film:0.213
    festival:0.060
    films:0.054
    directed:0.039
    feature:0.037
    
    ==========================================================
    Cluster 2    
    law:0.146
    court:0.097
    judge:0.074
    district:0.051
    justice:0.045
    
    ==========================================================
    Cluster 3    
    mayor:0.137
    city:0.049
    council:0.040
    elected:0.032
    election:0.030
    
    ==========================================================
    Cluster 4    
    music:0.147
    composition:0.048
    composer:0.047
    orchestra:0.026
    composers:0.025
    
    ==========================================================
    Cluster 5    
    album:0.108
    her:0.078
    billboard:0.075
    chart:0.072
    singles:0.067
    
    ==========================================================
    Cluster 6    
    music:0.052
    songs:0.025
    records:0.023
    song:0.022
    album:0.022
    
    ==========================================================
    Cluster 7    
    chairman:0.057
    board:0.048
    president:0.035
    executive:0.033
    ceo:0.025
    
    ==========================================================
    Cluster 8    
    german:0.120
    germany:0.042
    der:0.030
    berlin:0.025
    die:0.017
    
    ==========================================================
    Cluster 9    
    india:0.092
    indian:0.084
    sabha:0.038
    lok:0.033
    singh:0.028
    
    ==========================================================
    Cluster 10    
    czech:0.207
    prague:0.124
    republic:0.046
    czechoslovakia:0.032
    vclav:0.021
    
    ==========================================================
    Cluster 11    
    soccer:0.294
    league:0.071
    indoor:0.063
    team:0.055
    season:0.052
    
    ==========================================================
    Cluster 12    
    novel:0.096
    fiction:0.079
    published:0.044
    stories:0.043
    short:0.039
    
    ==========================================================
    Cluster 13    
    prison:0.036
    police:0.034
    sentenced:0.026
    court:0.025
    convicted:0.024
    
    ==========================================================
    Cluster 14    
    labor:0.104
    australian:0.097
    liberal:0.073
    election:0.067
    minister:0.064
    
    ==========================================================
    Cluster 15    
    radio:0.112
    show:0.069
    host:0.041
    station:0.035
    sports:0.026
    
    ==========================================================
    Cluster 16    
    bishop:0.147
    church:0.084
    diocese:0.075
    archbishop:0.073
    ordained:0.055
    
    ==========================================================
    Cluster 17    
    de:0.118
    la:0.050
    french:0.026
    el:0.021
    paris:0.018
    
    ==========================================================
    Cluster 18    
    clarinet:0.087
    bass:0.086
    saxophone:0.079
    flute:0.077
    music:0.062
    
    ==========================================================
    Cluster 19    
    book:0.045
    books:0.032
    published:0.027
    editor:0.025
    magazine:0.021
    
    ==========================================================
    Cluster 20    
    racing:0.135
    nascar:0.104
    car:0.094
    race:0.077
    series:0.075
    
    ==========================================================
    Cluster 21    
    tour:0.259
    pga:0.216
    golf:0.139
    open:0.073
    golfer:0.062
    
    ==========================================================
    Cluster 22    
    league:0.089
    town:0.063
    season:0.061
    club:0.059
    football:0.054
    
    ==========================================================
    Cluster 23    
    album:0.119
    released:0.058
    music:0.034
    records:0.027
    single:0.026
    
    ==========================================================
    Cluster 24    
    football:0.054
    cup:0.048
    club:0.045
    team:0.039
    league:0.036
    
    ==========================================================
    Cluster 25    
    league:0.096
    era:0.093
    baseball:0.089
    innings:0.086
    pitcher:0.085
    
    ==========================================================
    Cluster 26    
    air:0.373
    force:0.241
    command:0.105
    commander:0.094
    base:0.080
    
    ==========================================================
    Cluster 27    
    physics:0.173
    quantum:0.059
    theoretical:0.045
    research:0.043
    theory:0.039
    
    ==========================================================
    Cluster 28    
    sierra:0.276
    leone:0.220
    koroma:0.061
    freetown:0.056
    leonean:0.046
    
    ==========================================================
    Cluster 29    
    russian:0.172
    soviet:0.067
    russia:0.056
    moscow:0.056
    vladimir:0.022
    
    ==========================================================
    Cluster 30    
    harris:0.398
    university:0.014
    alabama:0.013
    state:0.013
    he:0.012
    
    ==========================================================
    Cluster 31    
    theatre:0.194
    directed:0.033
    production:0.031
    she:0.029
    play:0.029
    
    ==========================================================
    Cluster 32    
    linguistics:0.164
    language:0.147
    linguistic:0.064
    languages:0.044
    research:0.036
    
    ==========================================================
    Cluster 33    
    economics:0.150
    economic:0.098
    economist:0.053
    policy:0.047
    research:0.045
    
    ==========================================================
    Cluster 34    
    news:0.131
    anchor:0.068
    reporter:0.057
    she:0.044
    correspondent:0.034
    
    ==========================================================
    Cluster 35    
    rights:0.172
    human:0.129
    law:0.056
    she:0.036
    civil:0.029
    
    ==========================================================
    Cluster 36    
    foreign:0.062
    ambassador:0.056
    affairs:0.051
    security:0.044
    secretary:0.043
    
    ==========================================================
    Cluster 37    
    mathematics:0.151
    mathematical:0.113
    theory:0.055
    professor:0.047
    mathematician:0.047
    
    ==========================================================
    Cluster 38    
    mexico:0.132
    mexican:0.118
    de:0.040
    pri:0.031
    cartel:0.026
    
    ==========================================================
    Cluster 39    
    film:0.141
    documentary:0.087
    films:0.056
    festival:0.041
    cinema:0.033
    
    ==========================================================
    Cluster 40    
    hong:0.283
    kong:0.267
    chinese:0.067
    china:0.038
    wong:0.034
    
    ==========================================================
    Cluster 41    
    actor:0.049
    role:0.047
    film:0.043
    series:0.042
    appeared:0.037
    
    ==========================================================
    Cluster 42    
    football:0.120
    afl:0.113
    australian:0.083
    season:0.058
    club:0.058
    
    ==========================================================
    Cluster 43    
    band:0.116
    album:0.046
    bands:0.035
    guitar:0.033
    rock:0.031
    
    ==========================================================
    Cluster 44    
    puerto:0.308
    rico:0.217
    rican:0.068
    juan:0.042
    ricos:0.032
    
    ==========================================================
    Cluster 45    
    giants:0.257
    baseball:0.104
    league:0.085
    francisco:0.065
    san:0.064
    
    ==========================================================
    Cluster 46    
    racing:0.104
    jockey:0.078
    race:0.069
    stakes:0.064
    horse:0.053
    
    ==========================================================
    Cluster 47    
    archaeology:0.281
    archaeological:0.114
    ancient:0.072
    archaeologist:0.059
    excavations:0.055
    
    ==========================================================
    Cluster 48    
    art:0.082
    artist:0.032
    gallery:0.031
    painting:0.028
    paintings:0.028
    
    ==========================================================
    Cluster 49    
    she:0.136
    her:0.120
    actress:0.024
    film:0.018
    television:0.013
    
    ==========================================================
    Cluster 50    
    comics:0.191
    comic:0.122
    strip:0.039
    graphic:0.036
    book:0.034
    
    ==========================================================
    Cluster 51    
    comedy:0.165
    show:0.068
    comedian:0.060
    standup:0.050
    series:0.030
    
    ==========================================================
    Cluster 52    
    formula:0.167
    racing:0.129
    car:0.079
    driver:0.078
    championship:0.074
    
    ==========================================================
    Cluster 53    
    church:0.186
    lds:0.185
    churchs:0.094
    latterday:0.070
    byu:0.068
    
    ==========================================================
    Cluster 54    
    design:0.169
    architecture:0.121
    architectural:0.058
    architects:0.038
    architect:0.038
    
    ==========================================================
    Cluster 55    
    university:0.048
    philosophy:0.042
    professor:0.041
    studies:0.038
    history:0.037
    
    ==========================================================
    Cluster 56    
    food:0.260
    cooking:0.049
    she:0.038
    cookbook:0.031
    culinary:0.028
    
    ==========================================================
    Cluster 57    
    oklahoma:0.212
    oregon:0.169
    portland:0.040
    district:0.032
    law:0.031
    
    ==========================================================
    Cluster 58    
    piano:0.093
    music:0.071
    orchestra:0.065
    chamber:0.045
    symphony:0.040
    
    ==========================================================
    Cluster 59    
    iraqi:0.160
    iraq:0.150
    baghdad:0.060
    saddam:0.044
    hussein:0.035
    
    ==========================================================
    Cluster 60    
    business:0.034
    company:0.023
    technology:0.023
    management:0.023
    global:0.019
    
    ==========================================================
    Cluster 61    
    league:0.120
    baseball:0.108
    major:0.058
    minor:0.057
    season:0.042
    
    ==========================================================
    Cluster 62    
    freestyle:0.154
    swimming:0.124
    m:0.117
    swimmer:0.090
    heat:0.074
    
    ==========================================================
    Cluster 63    
    song:0.162
    eurovision:0.112
    contest:0.073
    she:0.057
    her:0.038
    
    ==========================================================
    Cluster 64    
    bbc:0.235
    radio:0.119
    news:0.053
    presenter:0.053
    she:0.051
    
    ==========================================================
    Cluster 65    
    army:0.078
    command:0.078
    commander:0.078
    military:0.074
    staff:0.059
    
    ==========================================================
    Cluster 66    
    jazz:0.214
    music:0.047
    band:0.035
    pianist:0.026
    trio:0.024
    
    ==========================================================
    Cluster 67    
    chef:0.195
    restaurant:0.130
    wine:0.102
    cooking:0.060
    food:0.059
    
    ==========================================================
    Cluster 68    
    turkish:0.176
    turkey:0.104
    istanbul:0.072
    ankara:0.029
    she:0.026
    
    ==========================================================
    Cluster 69    
    thai:0.158
    cpc:0.063
    china:0.060
    party:0.057
    thailand:0.055
    
    ==========================================================
    Cluster 70    
    miss:0.358
    pageant:0.206
    usa:0.122
    she:0.111
    her:0.063
    
    ==========================================================
    Cluster 71    
    republican:0.070
    senate:0.054
    district:0.048
    state:0.041
    house:0.039
    
    ==========================================================
    Cluster 72    
    wrestling:0.069
    championship:0.046
    tennis:0.045
    doubles:0.039
    champion:0.036
    
    ==========================================================
    Cluster 73    
    poetry:0.214
    poems:0.081
    poet:0.069
    poets:0.043
    literary:0.041
    
    ==========================================================
    Cluster 74    
    baseball:0.235
    league:0.077
    she:0.055
    aagpbl:0.045
    allamerican:0.039
    
    ==========================================================
    Cluster 75    
    poker:0.477
    wsop:0.121
    event:0.091
    limit:0.078
    winnings:0.072
    
    ==========================================================
    Cluster 76    
    rugby:0.195
    cup:0.049
    against:0.046
    played:0.044
    wales:0.039
    
    ==========================================================
    Cluster 77    
    he:0.010
    that:0.009
    his:0.009
    it:0.007
    has:0.006
    
    ==========================================================
    Cluster 78    
    blues:0.121
    drummer:0.066
    band:0.054
    rock:0.041
    drum:0.037
    
    ==========================================================
    Cluster 79    
    marathon:0.064
    olympics:0.059
    championships:0.057
    olympic:0.055
    she:0.041
    
    ==========================================================
    Cluster 80    
    hockey:0.218
    nhl:0.136
    ice:0.066
    season:0.053
    league:0.048
    
    ==========================================================
    Cluster 81    
    party:0.066
    minister:0.061
    election:0.044
    parliament:0.036
    elected:0.031
    
    ==========================================================
    Cluster 82    
    computer:0.096
    engineering:0.074
    research:0.047
    science:0.046
    systems:0.038
    
    ==========================================================
    Cluster 83    
    election:0.083
    manitoba:0.071
    minister:0.069
    liberal:0.068
    canadian:0.055
    
    ==========================================================
    Cluster 84    
    orchestra:0.221
    symphony:0.156
    conductor:0.132
    music:0.081
    philharmonic:0.080
    
    ==========================================================
    Cluster 85    
    sri:0.288
    lanka:0.187
    lankan:0.098
    colombo:0.048
    ceylon:0.029
    
    ==========================================================
    Cluster 86    
    basketball:0.157
    nba:0.086
    coach:0.077
    points:0.049
    season:0.042
    
    ==========================================================
    Cluster 87    
    cricket:0.194
    firstclass:0.114
    cricketer:0.073
    batsman:0.069
    wickets:0.061
    
    ==========================================================
    Cluster 88    
    runs:0.116
    league:0.100
    baseball:0.089
    batted:0.067
    home:0.064
    
    ==========================================================
    Cluster 89    
    she:0.178
    her:0.053
    women:0.021
    member:0.017
    university:0.017
    
    ==========================================================
    Cluster 90    
    research:0.061
    medical:0.047
    medicine:0.046
    professor:0.033
    chemistry:0.031
    
    ==========================================================
    Cluster 91    
    columbia:0.096
    vancouver:0.091
    bc:0.075
    british:0.073
    canadian:0.071
    
    ==========================================================
    Cluster 92    
    metres:0.179
    championships:0.145
    athletics:0.096
    she:0.079
    m:0.070
    
    ==========================================================
    Cluster 93    
    health:0.228
    medical:0.072
    medicine:0.071
    care:0.057
    research:0.034
    
    ==========================================================
    Cluster 94    
    football:0.118
    nfl:0.089
    yards:0.065
    coach:0.058
    bowl:0.046
    
    ==========================================================
    Cluster 95    
    jewish:0.219
    rabbi:0.172
    israel:0.041
    yeshiva:0.037
    hebrew:0.034
    
    ==========================================================
    Cluster 96    
    film:0.045
    producer:0.038
    television:0.037
    series:0.033
    directed:0.031
    
    ==========================================================
    Cluster 97    
    chess:0.416
    grandmaster:0.085
    olympiad:0.066
    championship:0.064
    fide:0.059
    
    ==========================================================
    Cluster 98    
    opera:0.227
    ballet:0.088
    she:0.065
    la:0.036
    her:0.034
    
    ==========================================================
    Cluster 99    
    art:0.208
    museum:0.149
    gallery:0.085
    arts:0.043
    contemporary:0.041
    
    ==========================================================
    
    In [85]:
    sum(np.bincount(cluster_assignment[k]) < 236) / k
    
    Out[85]:
    0.33
    In [81]:
    k
    
    Out[81]:
    100
    In [86]:
    sum(np.bincount(cluster_assignment[k]) < 236) 
    
    Out[86]:
    33
    In [ ]:
     
    
    In [ ]:
    !unzip images.sf.zip
    
    Archive:  images.sf.zip
    replace images.sf/m_eb749d4e5da0750d.frame_idx? [y]es, [n]o, [A]ll, [N]one, [r]ename: 
    In [2]:
    from __future__ import print_function # to conform python 2.x print to python 3.x
    import turicreate
    import numpy as np
    import matplotlib.pyplot as plt 
    import copy
    from scipy.stats import multivariate_normal
    
    %matplotlib inline
    
    In [3]:
    def log_sum_exp(Z):
        """ Compute log(\sum_i exp(Z_i)) for some array Z."""
        return np.max(Z) + np.log(np.sum(np.exp(Z - np.max(Z))))
    
    def loglikelihood(data, weights, means, covs):
        """ Compute the loglikelihood of the data for a Gaussian mixture model with the given parameters. """
        num_clusters = len(means)
        num_dim = len(data[0])
        
        ll = 0
        for d in data:
            
            Z = np.zeros(num_clusters)
            for k in range(num_clusters):
                
                # Compute (x-mu)^T * Sigma^{-1} * (x-mu)
                delta = np.array(d) - means[k]
                exponent_term = np.dot(delta.T, np.dot(np.linalg.inv(covs[k]), delta))
                
                # Compute loglikelihood contribution for this data point and this cluster
                Z[k] += np.log(weights[k])
                Z[k] -= 1/2. * (num_dim * np.log(2*np.pi) + np.log(np.linalg.det(covs[k])) + exponent_term)
                
            # Increment loglikelihood contribution of this data point across all clusters
            ll += log_sum_exp(Z)
            
        return ll
    
    In [4]:
    def compute_responsibilities(data, weights, means, covariances):
        '''E-step: compute responsibilities, given the current parameters'''
        num_data = len(data)
        num_clusters = len(means)
        resp = np.zeros((num_data, num_clusters))
        
        # Update resp matrix so that resp[i,k] is the responsibility of cluster k for data point i.
        # Hint: To compute likelihood of seeing data point i given cluster k, use multivariate_normal.pdf.
        for i in range(num_data):
            for k in range(num_clusters):
                # YOUR CODE HERE
                resp[i, k] = ...
    
        # Add up responsibilities over each data point and normalize
        row_sums = resp.sum(axis=1)[:, np.newaxis]
        resp = resp / row_sums
    
        return resp
    
    In [5]:
    images = turicreate.SFrame('images.sf')
    import array
    images['rgb'] = images.pack_columns(['red', 'green', 'blue'])['X4']
    
    # The result will pop out in a separate window
    images.explore()
    

    path image folder red green blue rgb
    0 /data/coursera/images/sunsets/ANd9GcSN4TPL6_XoTvZeg3-15UhGnWAwjhbxQLjTNiCpWIqMyzq27xIdlg.jpg sunsets 0.403223 0.254570 0.297392 [0.4032234842076981, 0.2545700932661306, 0.29739183218564663]
    1 /data/coursera/images/sunsets/ANd9GcQeme67tTCcvFbjg3xtvKPls3300iLBXVDEUfy8mx7yWaCAIqEWAw.jpg sunsets 0.556835 0.246377 0.039541 [0.5568352144946861, 0.24637719918003423, 0.03954077190224097]
    2 /data/coursera/images/sunsets/ANd9GcSAb2GMlHYIvV8eXZuUskgqHA-Oo2LfLjw3FsyeSDF0-5z1rzyk.jpg sunsets 0.530449 0.114891 0.072064 [0.5304492672627918, 0.11489094324391455, 0.07206392821659215]
    3 /data/coursera/images/sunsets/ANd9GcRs4-CSokZQJFe9vodGC6fK7ouonFopisgxltdHeLmmR85ny4hA.jpg sunsets 0.457634 0.194312 0.616880 [0.4576339546457987, 0.19431157564148255, 0.6168796125118793]
    4 /data/coursera/images/sunsets/ANd9GcQov2JpsVI0UO50J4NiyhsuxMNw90ffXe6U7PCHVJeQFIUKLZ4qdA.jpg sunsets 0.329559 0.231696 0.202149 [0.32955887691971036, 0.23169615548161854, 0.20214944771027293]
    5 /data/coursera/images/sunsets/ANd9GcRPQowxaXHVSyfep_8onh-Gjg6P1J5Pux6rEUe3Xuq25aeWwn82.jpg sunsets 0.431074 0.303505 0.270772 [0.4310738970368823, 0.30350462147887325, 0.27077178821516673]
    6 /data/coursera/images/sunsets/ANd9GcTIIpd0qDRpnGpffSCoHgc6iS4q5SgNDoXtAuXna-1WgB7qSbb-TA.jpg sunsets 0.331973 0.239819 0.153039 [0.3319727876348366, 0.23981891904330693, 0.15303946769394577]
    7 /data/coursera/images/sunsets/ANd9GcQm3Dvfoztf_kp79I9Lr487O3KOFJeo9q0E2TmbrBr9hz9MgKcB.jpg sunsets 0.417399 0.401744 0.550714 [0.4173993262543467, 0.4017440542722305, 0.5507135649528068]
    8 /data/coursera/images/sunsets/ANd9GcTCaO2qoZveB_UDbtBh8eit0Z3kThjh-sxKlVMsEM1_T58N7r5J.jpg sunsets 0.578920 0.314133 0.108881 [0.578919523099851, 0.31413344510680574, 0.10888055762543468]
    9 /data/coursera/images/sunsets/ANd9GcTZ_TW2uDnyBfL_9gkss-9BwWnM82dbFCi_omn-SeA7NkNYcxX6mQ.jpg sunsets 0.318551 0.319905 0.334770 [0.3185505930203676, 0.31990507016890213, 0.33476954483358173]
    10 /data/coursera/images/sunsets/ANd9GcQBmz9VBm4Okn1pjl7lqfdP47hq0PxOHtnNtjhFOhSoE0doLQzW.jpg sunsets 0.575210 0.289107 0.377153 [0.5752103484623016, 0.289106677827381, 0.377153087797619]
    11 /data/coursera/images/sunsets/ANd9GcSL1kq5zUHjlrKgr_MU1kVm4k3huzMPR3ifCMgWlswTQIBGPJPOXg.jpg sunsets 0.532248 0.468598 0.389243 [0.5322484320665674, 0.46859809674615005, 0.38924265710382516]
    12 /data/coursera/images/sunsets/ANd9GcSO1N3EzBmkWq-WzZzT1ZhlKwnrhAbcLtwDUp3BLVMiQqVRDfSp9Q.jpg sunsets 0.465509 0.206380 0.083270 [0.46550943809869233, 0.20637953552903535, 0.08326970157299575]
    13 /data/coursera/images/sunsets/ANd9GcTfpwOtPZJ8aqbHY1m91H0gSnPB2GZROK4OW5ilTX148q4Ya-4LyA.jpg sunsets 0.745278 0.257555 0.083041 [0.7452783088454802, 0.25755532778728657, 0.0830412806492059]
    14 /data/coursera/images/sunsets/ANd9GcQCV8_89q_UtP6v7HpojrNvIlDKIFcGVltUayatuPhAAal19B7t.jpg sunsets 0.681248 0.533516 0.325738 [0.6812476366277913, 0.5335155721350954, 0.32573776084165906]
    15 /data/coursera/images/sunsets/ANd9GcSn7xh8NQ95nLShi_wkshZY2xWcuXdx57sDXnCWIcUrNOAp2U2Isg.jpg sunsets 0.411161 0.504357 0.578829 [0.41116110181051585, 0.5043566313244048, 0.5788290550595238]
    16 /data/coursera/images/sunsets/ANd9GcRHGdeJnvFvnXkH2PmzkbDzoHzdSbMuO2ks8nqmjtJVCCCF9Xl1FA.jpg sunsets 0.439749 0.519179 0.618196 [0.43974897540983604, 0.5191793964232488, 0.6181964108296075]
    17 /data/coursera/images/sunsets/ANd9GcT8JlzLyo03Pp97H2B2qQMiFVGQ3tNCNxK3Wq28cUZvMx8A9sCA.jpg sunsets 0.553954 0.355582 0.320426 [0.553953717251951, 0.35558197463768115, 0.3204255691093327]
    18 /data/coursera/images/sunsets/ANd9GcSzeXthxtaN9plSJFOKddJQI5E4WB4Rzzsu-BD0CY5DdBwh96ngQg.jpg sunsets 0.743875 0.432845 0.450786 [0.7438753570541481, 0.4328445572528564, 0.45078598484848487]
    19 /data/coursera/images/sunsets/ANd9GcRdsHi_PwhUF3-9lVHHCy5COZoLwJAOYU80b1FWhC6IkwDNMbNB0A.jpg sunsets 0.443751 0.280791 0.291783 [0.44375077620466963, 0.28079071969696967, 0.29178294212617983]
    20 /data/coursera/images/sunsets/ANd9GcSHnEuQVIA8JxsVb_d09QJOt3bKqK75jZFyJVjp_RNi0YOx8Wv_og.jpg sunsets 0.418087 0.265995 0.210825 [0.41808657786885245, 0.26599478390461995, 0.21082495032290113]
    21 /data/coursera/images/sunsets/ANd9GcRn1a0FRSODbYAlzwMTd0VfIzbftWztX1-M38i-lwZ3hK4E2AYJ-Q.jpg sunsets 0.289492 0.141854 0.014024 [0.2894921103013834, 0.14185438796936758, 0.01402436388339921]
    22 /data/coursera/images/sunsets/ANd9GcRZY5x07-GUlyC9JOYmwnkwXa4LbP9jKiquDZX-7eNfCgQPusNfUg.jpg sunsets 0.606827 0.413317 0.414274 [0.60682666015625, 0.4133173828125, 0.414274169921875]
    23 /data/coursera/images/sunsets/ANd9GcQY71eLp6Z3qb-m8QiW-urhxVeKhGpC_JFqgWDwtuM27BeMyW-w.jpg sunsets 0.659578 0.451834 0.480988 [0.6595779854910714, 0.4518341548859127, 0.480988033234127]
    24 /data/coursera/images/sunsets/ANd9GcQYXnPhvKwOXyzG0U0XULdrnt0enuJADhuDuo1UWMFA2jV4lUu6.jpg sunsets 0.476831 0.267148 0.357970 [0.47683067871336315, 0.2671477583209141, 0.3579699919274714]
    25 /data/coursera/images/sunsets/ANd9GcQj3sU9Gf4MfBQaOnrpAFByyMZqWuhONdRfC-TNPTofV8S21BgtYA.jpg sunsets 0.672485 0.480775 0.136721 [0.6724851377110208, 0.4807750630822981, 0.13672116068940118]
    26 /data/coursera/images/sunsets/ANd9GcQQu3ojtg_WSsdPM6X4iNaoejKW58tJmz8gEwBegOmmTAwQ8CXmaA.jpg sunsets 0.465263 0.240950 0.081812 [0.4652632843501984, 0.24094974578373016, 0.08181206597222222]
    27 /data/coursera/images/sunsets/ANd9GcRiVGX2-ar-hH2D8GAXCxLT_uW2pQSTLut9T-sF9ITJ6y9u3S0Vcw.jpg sunsets 0.381877 0.444458 0.453180 [0.3818773561211904, 0.44445762898364766, 0.45318032843956396]
    28 /data/coursera/images/sunsets/ANd9GcQtFjZzNBLOpzTLyt7eWW7hTIvZQni8UbHZfJjiRXohldnMmgsN.jpg sunsets 0.549877 0.391405 0.446488 [0.5498773596621958, 0.3914049304520616, 0.4464880619721808]
    29 /data/coursera/images/sunsets/ANd9GcS6UV30PMi6vrUMsUt80oW_24U0V3ss_z-qeHj1G4RDjEOntSwW.jpg sunsets 0.722454 0.200460 0.303308 [0.7224544304521753, 0.20045983137961232, 0.3033080369581658]
    30 /data/coursera/images/sunsets/ANd9GcTi_xqhKxHngomZxpQOSqzLUMhy1S1qfPEWTcbN3aXL7HK7SMQNrw.jpg sunsets 0.773048 0.570237 0.313390 [0.7730481557377049, 0.5702365095628416, 0.3133903843765524]
    31 /data/coursera/images/sunsets/ANd9GcR_09vZLKNYUkOFtc1y_rkzMOpO0e6VS0QaSqDNnz0O1i5QVi3t.jpg sunsets 0.695401 0.346390 0.075317 [0.6954009331597222, 0.3463899739583333, 0.07531722780257936]
    32 /data/coursera/images/sunsets/ANd9GcRdZd6pv8D7DBPRZAO9tXy-HAyGSKcZg6Ebbs7l3i5xCfgu-YSC.jpg sunsets 0.495876 0.155497 0.065678 [0.4958762436046743, 0.15549716502926422, 0.06567814403666986]
    33 /data/coursera/images/sunsets/ANd9GcR04DFOem_nOZbkjwX_3BDAXtiPmF3mag5mwdlwAnZH5ceFmGoh.jpg sunsets 0.400289 0.356060 0.357375 [0.40028882575757574, 0.35605975223546943, 0.35737549677098857]
    34 /data/coursera/images/sunsets/ANd9GcQMLiOIwamPqz4FKFXyUH_hf2bstJvVft__uyvyFNzb5b9lyA5A.jpg sunsets 0.542837 0.356715 0.318294 [0.5428369140625, 0.356715339781746, 0.3182941158234127]
    35 /data/coursera/images/sunsets/ANd9GcSDEFQ_-wH3RCUKLylXnAIruTB9FArzxn68LD0_yuTQApnNBC20.jpg sunsets 0.619282 0.571417 0.569110 [0.6192816225782414, 0.5714173497267759, 0.5691100037257825]
    36 /data/coursera/images/sunsets/ANd9GcScNyrHeWLvH7am0TpltiEEJ9CosecsRC1wSlVBTqz9iswJOLjJDw.jpg sunsets 0.515521 0.522777 0.534728 [0.5155214542970691, 0.5227773379284649, 0.5347281731246896]
    37 /data/coursera/images/sunsets/ANd9GcTWXsYFNN0fYA2QetcrF1Irx330m11oMNQOS8Qai34ArGDUROO0.jpg sunsets 0.442885 0.157788 0.054802 [0.4428845021081349, 0.1577876984126984, 0.05480174231150794]
    38 /data/coursera/images/sunsets/ANd9GcSe81J9HiRw_02Hlgf3Wg5cjEPKPJLC7zYU5ofEwPkuJ4oqRRS9.jpg sunsets 0.412819 0.260080 0.291720 [0.41281893115401447, 0.2600799558118485, 0.29171950843478955]
    39 /data/coursera/images/sunsets/ANd9GcSdNeOBQTNbqPuEk5U-TVs8Dln66ujCasf11N1ieUHiEPV2TRVBpw.jpg sunsets 0.508744 0.310266 0.437166 [0.5087436351217088, 0.3102658500993542, 0.4371663096125186]
    40 /data/coursera/images/sunsets/ANd9GcS59y5m6MmNqym00PBKNjDPqmpMy8obOLBuOcUAVTanZZOJxt9W.jpg sunsets 0.509290 0.202590 0.333762 [0.5092896174863388, 0.20259035022354693, 0.3337617206905117]
    41 /data/coursera/images/sunsets/ANd9GcScj81Z5hZTG1IOjN6e7HjbAm8l0y8s8biNFDUdn2cwthi9UBNl1Q.jpg sunsets 0.591565 0.187007 0.042252 [0.5915654591390149, 0.18700748363770192, 0.042252044665592424]
    42 /data/coursera/images/sunsets/ANd9GcRk42vFH08TygmvldqZ1IEDPcHIxR_Ilcn88iTXx9PEhjGxfAre.jpg sunsets 0.543546 0.248770 0.255513 [0.5435462017337074, 0.24876971930452774, 0.25551326141879527]
    43 /data/coursera/images/sunsets/ANd9GcTpv1GoGEC6n0DBYdJavt99lKiEXEv9SE4DoedkF-LGBncy8v3G.jpg sunsets 0.471613 0.236832 0.208774 [0.4716129533035271, 0.2368323087431694, 0.2087737518628912]
    44 /data/coursera/images/sunsets/ANd9GcT64JvuOrtwaH6zF_kWoNhfV3lXS85A9fANobbt0p6SrDMThKFClg.jpg sunsets 0.438263 0.322950 0.249257 [0.4382634879776217, 0.32294991233682496, 0.2492565834176487]
    45 /data/coursera/images/sunsets/ANd9GcTHndsUHJHc03mGWbCfZeXjOkDIY2kLmyOSGDAQamV7aQJ-Nqy0jQ.jpg sunsets 0.585562 0.376926 0.213066 [0.5855616657459388, 0.37692598530319316, 0.21306559439202102]
    46 /data/coursera/images/sunsets/ANd9GcQWa02JP_-LQNysjkhufzln-sL3QPj_ka__scvNWd0ibnt1EPSb6w.jpg sunsets 0.483128 0.407718 0.355157 [0.48312779433681075, 0.4077181135121709, 0.3551571814456036]
    47 /data/coursera/images/sunsets/ANd9GcSOBwVrU-cVKExiQEgNWziiFSF5MU8MsWXdcC_jAgTpwsBiRpWU.jpg sunsets 0.364768 0.101154 0.115004 [0.36476823092883015, 0.10115419635393862, 0.11500386846714167]
    48 /data/coursera/images/sunsets/ANd9GcQk-WeL6N3tb6a79QiUQs-dDp-PI_Ye7dHviKjJAoj-IvN69rQiDg.jpg sunsets 0.252793 0.281998 0.277426 [0.25279276515377574, 0.2819982549783162, 0.2774258101575555]
    49 /data/coursera/images/sunsets/ANd9GcRuM9jj0NZmoAvB5jMOtCg01f-Ng27IjKjCMX_1cqa9rKk4gqOt.jpg sunsets 0.413875 0.441579 0.435464 [0.4138746741034112, 0.44157938256279106, 0.43546423471520124]
    50 /data/coursera/images/sunsets/ANd9GcR2xvU0J9rTDVrcFAgtfar2iSk0ENuCgsH7KRmN9Uf6BR-S-GZujw.jpg sunsets 0.443760 0.267036 0.073878 [0.44376001304023843, 0.2670364505712866, 0.07387815139095877]
    51 /data/coursera/images/sunsets/ANd9GcRctsS1SN9Kvcp1j5iMjih4jv9JrbL5xCHMSWSW0na76Q7bQovw.jpg sunsets 0.452440 0.310596 0.257824 [0.45243993302949487, 0.3105960082643395, 0.2578241613760299]
    52 /data/coursera/images/sunsets/ANd9GcT0shDIioVJO9-VH7VvwB-h6raUNYztF8e6BTmCuVG-v2MvEsz_Xg.jpg sunsets 0.579431 0.378367 0.251075 [0.5794314236111111, 0.3783672030009921, 0.25107530381944443]
    53 /data/coursera/images/sunsets/ANd9GcRRf0d8e7GZfwKAl8BiMqwj74Z24gerEfmTML_Kz8s2_PyMMGXADg.jpg sunsets 0.466466 0.160419 0.340548 [0.4664658625186289, 0.16041860717834078, 0.3405477676353701]
    54 /data/coursera/images/sunsets/ANd9GcTohz4jLZMHwLOnwnUFl5wLgEPd89_DincOcBDGmKMNx8Und-JTgQ.jpg sunsets 0.514728 0.514224 0.541076 [0.5147275555357567, 0.5142236130909955, 0.5410763362982102]
    55 /data/coursera/images/sunsets/ANd9GcSfgH0w3HIR0J7yhFdTrWai5f3K2dM7gJJWp79vl6gylwCwXYZXng.jpg sunsets 0.384980 0.363514 0.285966 [0.384980101320447, 0.3635140589577616, 0.2859663816718371]
    56 /data/coursera/images/sunsets/ANd9GcS97_PhVMp6bDW3-VBEv0_dRP1804uXcZa4oPGmc2MGHKO23DkKAg.jpg sunsets 0.671842 0.565891 0.583955 [0.6718417796467703, 0.5658906392152209, 0.5839545539387286]
    57 /data/coursera/images/sunsets/ANd9GcSlo6r4Qa6Xnfj6t0gXLugxCbGQEz8dhq8kWFchwLZTlrR_JBJGFQ.jpg sunsets 0.630192 0.382246 0.163806 [0.6301920572916667, 0.3822457837301587, 0.16380564856150795]
    58 /data/coursera/images/sunsets/ANd9GcS3jzoIsv2p_HUHs73YnXUPLV7Mt2TPFz1y0sMga1R78Do0300w.jpg sunsets 0.443019 0.272771 0.080626 [0.44301920983837945, 0.2727710698169095, 0.08062640399180963]
    59 /data/coursera/images/sunsets/ANd9GcTIozvbWNOTPiQ6c4VRIZ7t9u-I9EXD1l1eAAaO1iCTlG4nrdIBlA.jpg sunsets 0.377749 0.126360 0.092872 [0.37774948708620415, 0.12636039711530847, 0.09287204502257068]
    60 /data/coursera/images/sunsets/ANd9GcTk_Y-Gwpbse6DT0TDV4dYWLZaybJkGT8F8az_xmuZ-Vc83Y6Oz4A.jpg sunsets 0.491294 0.335929 0.239821 [0.4912938524508634, 0.3359291851923689, 0.23982096587292115]
    61 /data/coursera/images/sunsets/ANd9GcQZ6ZKxfRN7ixcj1cwdMHJP1cSKmtpa090W8cikckkYiZAeFl8Q.jpg sunsets 0.379466 0.200094 0.238525 [0.3794658358134921, 0.20009362599206348, 0.2385254681299603]
    62 /data/coursera/images/sunsets/ANd9GcRScYWqbMMB9bj6yLEDc9sVBhKuLpkGX6XDlXUIDvVKeARTMG5V.jpg sunsets 0.623014 0.370704 0.110402 [0.6230137698708396, 0.37070417287630403, 0.11040246212121212]
    63 /data/coursera/images/sunsets/ANd9GcQa35xTzYNjAQNDmtLwFeDlzmWaIo8FCdO2s5vDLJg0UEMyb7DN.jpg sunsets 0.335102 0.371289 0.485207 [0.3351016739744352, 0.3712885980231867, 0.48520672314704716]
    64 /data/coursera/images/sunsets/ANd9GcT1lcYtmDTsCb-WwpofjsIne3AWkAd8WDz1328UOGizlxjdKIFq.jpg sunsets 0.309545 0.378142 0.335398 [0.3095446783407849, 0.3781423093641331, 0.33539842585692997]
    65 /data/coursera/images/sunsets/ANd9GcSKElB-04bh4dpM1nCtbyvAfZIeEvdx0QJWr0eQ-zOG5Jxopx-d.jpg sunsets 0.585963 0.267481 0.175817 [0.5859629995422522, 0.2674805581538829, 0.17581724478067906]
    66 /data/coursera/images/sunsets/ANd9GcSgx8rNDq8i6vN1LZJFdDBumHD6XgQyyZOhzCr2cl9Rcq59bZPd.jpg sunsets 0.286388 0.279029 0.330032 [0.2863882420516642, 0.27902857985593643, 0.3300316691505216]
    67 /data/coursera/images/sunsets/ANd9GcQaMTGizr08BdDihGFW22sxHyZUFZazT0-DhlYUjZLceKYUv87Jvw.jpg sunsets 0.464538 0.199905 0.409143 [0.4645377526383911, 0.19990510503783354, 0.4091432851702509]
    68 /data/coursera/images/sunsets/ANd9GcT23Os5u_Nm5DMZ6yW0YsYDIMsx720OrxuzF_oGvPbGRo3nXc1u.jpg sunsets 0.294156 0.215644 0.214431 [0.29415622585581835, 0.21564414509786442, 0.2144307452008796]
    69 /data/coursera/images/sunsets/ANd9GcTHq57PL5KvIcXl6_vjFHO4s-Vyh6rIPNMx9gZxdz1aZleh6Ssa.jpg sunsets 0.415247 0.535832 0.502496 [0.41524677579365077, 0.5358323257688492, 0.5024958922371032]
    70 /data/coursera/images/sunsets/ANd9GcRBaacedfTfxlzjyPY2HK0HOJKXVMWRMzlbYt1QcyI1xoWXkpXeMQ.jpg sunsets 0.531277 0.425165 0.297173 [0.5312770492311508, 0.4251651630704365, 0.29717347160218255]
    71 /data/coursera/images/sunsets/ANd9GcSYhWwY-sXem8YVU3PknNC4UCRRMuLmgOlfSptznw2Hk9CY1AOT.jpg sunsets 0.621522 0.450985 0.521145 [0.6215222925981123, 0.4509850813462494, 0.5211454452309985]
    72 /data/coursera/images/sunsets/ANd9GcQkM2nQvq8FPmcA-T6DchVkGohw-sp6KbqMaTtFds61p_0-rLgzQQ.jpg sunsets 0.290299 0.184462 0.207466 [0.2902993936975375, 0.1844621163523889, 0.20746628786748617]
    73 /data/coursera/images/sunsets/ANd9GcR-YX1KgguWCg0MNxzSDxogqGBxOVyTFN2SseQWX-8gK62MN83xVg.jpg sunsets 0.554009 0.295853 0.196315 [0.5540094669408806, 0.29585288469153403, 0.19631504848241862]
    74 /data/coursera/images/sunsets/ANd9GcTX5X9PZ2f1sDudW9vPLXoKL5ckSNrJS7fu0uui1Y9EhA7v-vC2.jpg sunsets 0.435353 0.309032 0.278687 [0.4353533451140873, 0.30903173053075395, 0.27868714347718254]
    75 /data/coursera/images/sunsets/ANd9GcQV2VuOnacUOwUZ8vqvSL4uaLLntgTRqnPo8bjV-npkgItZXaRK.jpg sunsets 0.402675 0.356356 0.346943 [0.4026751277361379, 0.3563563328793813, 0.3469434811079743]
    76 /data/coursera/images/sunsets/ANd9GcQPtkSzAaqQJIoHlwwmjZeFef_RukkLkxfaJzTkNtMXfuYYU3hl.jpg sunsets 0.416591 0.188316 0.101859 [0.4165913748137109, 0.18831610158966716, 0.10185862208147044]
    77 /data/coursera/images/sunsets/ANd9GcRhL1IJeKEURBZpbP2shUQ41xrIYryPtDFsg2yKno4FawMshPGRkQ.jpg sunsets 0.608117 0.426154 0.395832 [0.6081169274714356, 0.4261540611028316, 0.3958324795081967]
    78 /data/coursera/images/sunsets/ANd9GcRhF0P5E1Y-F8FHhHkHJa0ec1y4fa78ywBE7Mu68NJWXMP9RbqXRg.jpg sunsets 0.531097 0.216416 0.095466 [0.5310973576288333, 0.21641643132364025, 0.095465948198913]
    79 /data/coursera/images/sunsets/ANd9GcTPV9Ybox40SXw4rpACDUZ1EBynvRVJnunalJNcDijCI0FavvPy.jpg sunsets 0.408670 0.375327 0.345399 [0.4086701242387454, 0.3753272182113999, 0.34539860760060503]
    80 /data/coursera/images/sunsets/ANd9GcSzRSJPYL2agiGFcgon1F-ONbwMbCQh2RtxPODhDybW7hwSAxJA.jpg sunsets 0.197723 0.182190 0.178018 [0.19772340682629388, 0.18219019556811422, 0.17801810678167757]
    81 /data/coursera/images/sunsets/ANd9GcQyjcRGJits5hRypGoCs9BLBScDCm34YsPINciwlNnDod7lWuSJ.jpg sunsets 0.376630 0.168168 0.465189 [0.3766297022956951, 0.16816783913225114, 0.46518939733030956]
    82 /data/coursera/images/sunsets/ANd9GcRk05mYkqARwt87BQNGJ20BNbd4fb_eaAFd1_hval70sa7P_KomMQ.jpg sunsets 0.300981 0.135470 0.257158 [0.30098057935916545, 0.13547014716840536, 0.25715831470442124]
    83 /data/coursera/images/sunsets/ANd9GcQXlNshRcQ7WX_PrDXQw9esYY8MKy7Wmacg39HEEDp4N2FSsUWw.jpg sunsets 0.648614 0.450321 0.301235 [0.6486141338045635, 0.45032149057539683, 0.3012353515625]
    84 /data/coursera/images/sunsets/ANd9GcQ-6pQo9fg5S5QMcV8N8J54yYqrubYCWw-bq4-Da1Ku1hkEJEO3.jpg sunsets 0.827147 0.653187 0.432993 [0.8271469821162444, 0.6531869411326379, 0.43299304520615994]
    85 /data/coursera/images/sunsets/ANd9GcQgCTRJROBCiF-kcOxAmwiKnb0UkY3aqtQOCPvuvoLOVwKXZqkw9A.jpg sunsets 0.383679 0.483137 0.619628 [0.3836793954670547, 0.48313703089114596, 0.6196284035449829]
    86 /data/coursera/images/sunsets/ANd9GcQVSV4PiEMenWys7kH1QVPmpbV43zbfTdu9-hWbVfpeAxbq8XVQEw.jpg sunsets 0.241059 0.144941 0.100495 [0.24105870863970588, 0.14494128204491255, 0.10049464676073132]
    87 /data/coursera/images/sunsets/ANd9GcR1ZoGvVPetGeDhINF9g8ycjdvPYWiGSduJNTFLakwsET9Y9CJR.jpg sunsets 0.461697 0.233015 0.056374 [0.46169728112599207, 0.23301478794642858, 0.05637369791666667]
    88 /data/coursera/images/sunsets/ANd9GcS5RMMQsu9ognNCUVV-eROtdtRFWB_2r0NcIS5Z9Qee7v9dha2n.jpg sunsets 0.603878 0.461333 0.384608 [0.603877636304115, 0.4613328371145986, 0.3846076761586842]
    89 /data/coursera/images/sunsets/ANd9GcRmGhuHn7UQfIpL0Q8mdNb_L1pNH1NgSIKnT-3ypcTGvAfvlB7C.jpg sunsets 0.472948 0.096230 0.102576 [0.4729475109300477, 0.09623029796800477, 0.10257640165441176]
    90 /data/coursera/images/sunsets/ANd9GcQBQJncxQKcm335M26pQvhSqS9tPq9BMQFzrn8tOXKhpOrqYi071w.jpg sunsets 0.371514 0.412192 0.477004 [0.3715135214853453, 0.41219184674615, 0.47700431569796325]
    91 /data/coursera/images/sunsets/ANd9GcSB33ZBKEQhT7tZCua7rhuHepu2i0tgPIKYpDsuF8rJOHs-9b4MBA.jpg sunsets 0.459873 0.198164 0.530288 [0.4598728160808069, 0.19816370193264105, 0.5302883575286464]
    92 /data/coursera/images/sunsets/ANd9GcQtEAQ5WEY6aiaaPCPHFZk4z0NGd6oZ5Te_nZHDckoqsLlr4QKz.jpg sunsets 0.527999 0.388483 0.309215 [0.5279985812891027, 0.388483089931298, 0.3092145520264117]
    93 /data/coursera/images/sunsets/ANd9GcTAC2Eb-Aq_weA9-P17L26UZgQ0wZoatyJFQIDqKNlSN3Zph9jG.jpg sunsets 0.628384 0.386768 0.207165 [0.6283844375397868, 0.38676809488640884, 0.20716475317299277]
    94 /data/coursera/images/sunsets/ANd9GcRkvq_jSrRFA3ZIDA2Nd_zP9VgetqlJyACwdyF8UdIVRmoPTreFZg.jpg sunsets 0.560472 0.196484 0.035950 [0.5604716590750697, 0.19648390830346477, 0.03595010080645161]
    95 /data/coursera/images/sunsets/ANd9GcRHWdFeoQcQT5A4iWGdiNyKw9_WLEn6Yvjz0yjh2-4tuw6Yd9Bh.jpg sunsets 0.652187 0.355665 0.144637 [0.6521867101361303, 0.35566536079986466, 0.14463682432432431]
    96 /data/coursera/images/sunsets/ANd9GcSo6l5a9PEDX1fWvW1TkwLXT6RCJMKaUCv8i-r-UAAyr06WHeWsjA.jpg sunsets 0.453424 0.249679 0.444943 [0.4534243143066173, 0.24967914501712057, 0.44494280911570316]
    97 /data/coursera/images/sunsets/ANd9GcSRb3OXTt69DIbG0ZDr9WhjwE9LBrZsgE5eqcOzlzJ0-0zDoQpw.jpg sunsets 0.194003 0.198174 0.418481 [0.19400325298422977, 0.1981743006177008, 0.41848088864006516]
    98 /data/coursera/images/sunsets/ANd9GcRZ6t5IILMj0_bEoTaT8TOnR-0nRMvsdBab7KjEIqwokBxRH23BJQ.jpg sunsets 0.512699 0.322139 0.394569 [0.5126989552331349, 0.32213898189484125, 0.39456876240079364]
    99 /data/coursera/images/sunsets/ANd9GcRD94TAbd4Pw_uLY2nleDbKnjOAoRrDpt5VxTma-XF0nZcykJJ5pg.jpg sunsets 0.604095 0.414971 0.371276 [0.6040952731311945, 0.41497092119273177, 0.3712759784360944]
    In [6]:
    def log_sum_exp(Z):
        """ Compute log(\sum_i exp(Z_i)) for some array Z."""
        return np.max(Z) + np.log(np.sum(np.exp(Z - np.max(Z))))
    
    def loglikelihood(data, weights, means, covs):
        """ Compute the loglikelihood of the data for a Gaussian mixture model with the given parameters. """
        num_clusters = len(means)
        num_dim = len(data[0])
        
        ll = 0
        for d in data:
            
            Z = np.zeros(num_clusters)
            for k in range(num_clusters):
                
                # Compute (x-mu)^T * Sigma^{-1} * (x-mu)
                delta = np.array(d) - means[k]
                exponent_term = np.dot(delta.T, np.dot(np.linalg.inv(covs[k]), delta))
                
                # Compute loglikelihood contribution for this data point and this cluster
                Z[k] += np.log(weights[k])
                Z[k] -= 1/2. * (num_dim * np.log(2*np.pi) + np.log(np.linalg.det(covs[k])) + exponent_term)
                
            # Increment loglikelihood contribution of this data point across all clusters
            ll += log_sum_exp(Z)
            
        return ll
    
    In [7]:
    import numpy as np
    from scipy.stats import multivariate_normal
    
    def compute_responsibilities(data, weights, means, covariances):
        '''E-step: compute responsibilities, given the current parameters'''
        num_data = len(data)
        num_clusters = len(means)
        resp = np.zeros((num_data, num_clusters))
        
        # Update resp matrix so that resp[i,k] is the responsibility of cluster k for data point i.
        # Hint: To compute likelihood of seeing data point i given cluster k, use multivariate_normal.pdf.
        for i in range(num_data):
            for k in range(num_clusters):
                # YOUR CODE HERE
                resp[i, k] = weights[k]*multivariate_normal.pdf(data[i], means[k], covariances[k])
    
        # Add up responsibilities over each data point and normalize
        row_sums = resp.sum(axis=1)[:, np.newaxis]
        resp = resp / row_sums
    
        return resp
    
    In [8]:
    def compute_soft_counts(resp):
        # Compute the total responsibility assigned to each cluster, which will be useful when 
        # implementing M-steps below. In the lectures this is called N^{soft}
        counts = np.sum(resp, axis=0)
        return counts
    
    In [9]:
    def compute_weights(counts):
        num_clusters = len(counts)
        #weights = [0.] * num_clusters
        
        #for k in range(num_clusters):
            # Update the weight for cluster k using the M-step update rule for the cluster weight, \hat{\pi}_k.
            # HINT: compute # of data points by summing soft counts.
            # YOUR CODE HERE
            #weights[k] = compute_soft_counts(resp[:,k]) / len(resp)
        weights = counts / sum(counts)
        
        return weights
    
    In [10]:
    def compute_means(data, resp, counts):
        num_clusters = len(counts)
        num_data = len(data)
        means = [np.zeros(len(data[0]))] * num_clusters
        
        for k in range(num_clusters):
            # Update means for cluster k using the M-step update rule for the mean variables.
            # This will assign the variable means[k] to be our estimate for \hat{\mu}_k.
            weighted_sum = 0.
            for i in range(num_data):
                # YOUR CODE HERE
                weighted_sum += np.sum(resp[i,k], axis=0) * data[i]
            # YOUR CODE HERE
            means[k] = weighted_sum / compute_soft_counts(resp[:,k])
    
        return means
    
    In [11]:
    def compute_covariances(data, resp, counts, means):
        num_clusters = len(counts)
        num_dim = len(data[0])
        num_data = len(data)
        covariances = [np.zeros((num_dim,num_dim))] * num_clusters
        
        for k in range(num_clusters):
            # Update covariances for cluster k using the M-step update rule for covariance variables.
            # This will assign the variable covariances[k] to be the estimate for \hat{\Sigma}_k.
            weighted_sum = np.zeros((num_dim, num_dim))
            for i in range(num_data):
                # YOUR CODE HERE (Hint: Use np.outer on the data[i] and this cluster's mean)
                weighted_sum += np.sum(resp[i,k], axis=0) * np.outer(data[i] - means[k], data[i] - means[k])
            # YOUR CODE HERE
            covariances[k] = weighted_sum / compute_soft_counts(resp[:,k])
    
        return covariances
    
    In [12]:
    # SOLUTION
    def EM(data, init_means, init_covariances, init_weights, maxiter=1000, thresh=1e-4):
        
        # Make copies of initial parameters, which we will update during each iteration
        means = init_means[:]
        covariances = init_covariances[:]
        weights = init_weights[:]
        
        # Infer dimensions of dataset and the number of clusters
        num_data = len(data)
        num_dim = len(data[0])
        num_clusters = len(means)
        
        # Initialize some useful variables
        resp = np.zeros((num_data, num_clusters))
        ll = loglikelihood(data, weights, means, covariances)
        ll_trace = [ll]
        
        for it in range(maxiter):
            if it % 5 == 0:
                print("Iteration %s" % it)
            
            # E-step: compute responsibilities
            resp = compute_responsibilities(data, weights, means, covariances)
    
            # M-step
            # Compute the total responsibility assigned to each cluster, which will be useful when 
            # implementing M-steps below. In the lectures this is called N^{soft}
            counts = compute_soft_counts(resp)
            
            # Update the weight for cluster k using the M-step update rule for the cluster weight, \hat{\pi}_k.
            # YOUR CODE HERE
            weights = compute_weights(counts)
            
            # Update means for cluster k using the M-step update rule for the mean variables.
            # This will assign the variable means[k] to be our estimate for \hat{\mu}_k.
            # YOUR CODE HERE
            means = compute_means(data, resp, counts)
            
            # Update covariances for cluster k using the M-step update rule for covariance variables.
            # This will assign the variable covariances[k] to be the estimate for \hat{\Sigma}_k.
            # YOUR CODE HERE
            covariances = compute_covariances(data, resp, counts, means)
            
            # Compute the loglikelihood at this iteration
            # YOUR CODE HERE
            ll_latest = loglikelihood(data, weights, means, covariances)
            ll_trace.append(ll_latest)
            
            # Check for convergence in log-likelihood and store
            if (ll_latest - ll) < thresh and ll_latest > -np.inf:
                break
            ll = ll_latest
        
        if it % 5 != 0:
            print("Iteration %s" % it)
        
        out = {'weights': weights, 'means': means, 'covs': covariances, 'loglik': ll_trace, 'resp': resp}
    
        return out
    
    In [13]:
    np.random.seed(1)
    
    # Initalize parameters
    init_means = [images['rgb'][x] for x in np.random.choice(len(images), 4, replace=False)]
    cov = np.diag([images['red'].var(), images['green'].var(), images['blue'].var()])
    init_covariances = [cov, cov, cov, cov]
    init_weights = [1/4., 1/4., 1/4., 1/4.]
    
    # Convert rgb data to numpy arrays
    img_data = [np.array(i) for i in images['rgb']]  
    
    # Run our EM algorithm on the image data using the above initializations. 
    # This should converge in about 125 iterations
    out = EM(img_data, init_means, init_covariances, init_weights)
    
    Iteration 0
    Iteration 5
    Iteration 10
    Iteration 15
    Iteration 20
    Iteration 25
    Iteration 30
    Iteration 35
    Iteration 40
    Iteration 45
    Iteration 50
    Iteration 55
    Iteration 60
    Iteration 65
    Iteration 70
    Iteration 75
    Iteration 80
    Iteration 85
    Iteration 90
    Iteration 95
    Iteration 100
    Iteration 105
    Iteration 110
    Iteration 115
    Iteration 118
    
    In [14]:
    ll = out['loglik']
    plt.plot(range(len(ll)),ll,linewidth=4)
    plt.xlabel('Iteration')
    plt.ylabel('Log-likelihood')
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    In [15]:
    plt.figure()
    plt.plot(range(3,len(ll)),ll[3:],linewidth=4)
    plt.xlabel('Iteration')
    plt.ylabel('Log-likelihood')
    plt.rcParams.update({'font.size':16})
    plt.tight_layout()
    
    In [16]:
    import colorsys
    def plot_responsibilities_in_RB(img, resp, title):
        N, K = resp.shape
        
        HSV_tuples = [(x*1.0/K, 0.5, 0.9) for x in range(K)]
        RGB_tuples = list(map(lambda x: colorsys.hsv_to_rgb(*x), HSV_tuples))
        
        R = img['red']
        B = img['blue']
        resp_by_img_int = [[resp[n][k] for k in range(K)] for n in range(N)]
        cols = [tuple(np.dot(resp_by_img_int[n], np.array(RGB_tuples))) for n in range(N)]
    
        plt.figure()
        for n in range(len(R)):
            plt.plot(R[n], B[n], 'o', c=cols[n])
        plt.title(title)
        plt.xlabel('R value')
        plt.ylabel('B value')
        plt.rcParams.update({'font.size':16})
        plt.tight_layout()
    
    In [17]:
    N, K = out['resp'].shape
    random_resp = np.random.dirichlet(np.ones(K), N)
    plot_responsibilities_in_RB(images, random_resp, 'Random responsibilities')
    
    In [18]:
    out = EM(img_data, init_means, init_covariances, init_weights, maxiter=1)
    plot_responsibilities_in_RB(images, out['resp'], 'After 1 iteration')
    
    Iteration 0
    
    In [19]:
    out = EM(img_data, init_means, init_covariances, init_weights, maxiter=20)
    plot_responsibilities_in_RB(images, out['resp'], 'After 20 iterations')
    
    Iteration 0
    Iteration 5
    Iteration 10
    Iteration 15
    Iteration 19
    
    In [20]:
    loglikelihood([img_data[0]], out['weights'], out['means'], out['covs'])
    
    Out[20]:
    1.783512009438448
    In [21]:
    for k in range(4):
        print(out['weights'][k]*multivariate_normal.pdf(img_data[0], out['means'][k], out['covs'][k]))
    
    1.4398721134198873e-08
    9.524062435886664e-10
    0.08790522926235855
    5.8628134985855125
    
    In [22]:
    weights = out['weights']
    means = out['means']
    covariances = out['covs']
    rgb = images['rgb']
    N = len(images) # number of images
    K = len(means) # number of clusters
    
    assignments = [0]*N
    probs = [0]*N
    
    for i in range(N):
        # Compute the score of data point i under each Gaussian component:
        p = np.zeros(K)
        for k in range(K):
            p[k] = weights[k]*multivariate_normal.pdf(rgb[i], mean=means[k], cov=covariances[k])
            
        # Compute assignments of each data point to a given cluster based on the above scores:
        assignments[i] = np.argmax(p)
        
        # For data point i, store the corresponding score under this cluster assignment:
        probs[i] = np.max(p)
    
    assignments = turicreate.SFrame({'assignments':assignments, 'probs':probs, 'image': images['image']})
    
    In [24]:
    def get_top_images(assignments, cluster, k=5):
        # YOUR CODE HERE
        images_in_cluster = assignments[assignments['assignments'] == cluster]
        top_images = images_in_cluster.topk('probs', k)
        return top_images['image']
    
    In [25]:
    # Images will appear in a separate window
    for component_id in range(4):
        get_top_images(assignments, component_id).explore()
    

    SArray
    0
    1
    2
    3
    4

    SArray
    0
    1
    2
    3
    4

    SArray
    0
    1
    2
    3
    4

    SArray
    0
    1
    2
    3
    4
    In [ ]:
     
    
    In [2]:
    from __future__ import print_function # to conform python 2.x print to python 3.x
    import turicreate
    
    from em_utilities import *
    
    In [3]:
    wiki = turicreate.SFrame('people_wiki.sframe/').head(5000)
    wiki['tf_idf'] = turicreate.text_analytics.tf_idf(wiki['text'])
    
    In [4]:
    wiki = wiki.add_row_number()
    tf_idf, map_word_to_index = sframe_to_scipy(wiki, 'tf_idf')
    map_index_to_word = dict([[map_word_to_index[i], i] for i in map_word_to_index.keys()])
    
    Using default 16 lambda workers.
    To maximize the degree of parallelism, add the following code to the beginning of the program:
    "turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)"
    Note that increasing the degree of parallelism also increases the memory footprint.
    In [5]:
    %%time
    tf_idf = normalize(tf_idf)
    
    CPU times: user 6.32 ms, sys: 1.23 ms, total: 7.55 ms
    Wall time: 7.92 ms
    
    In [6]:
    for i in range(5):
        doc = tf_idf[i]
        print(np.linalg.norm(doc.todense()))
    
    1.0
    1.0
    0.9999999999999998
    1.0000000000000007
    1.0
    
    In [58]:
    %%time 
    
    from sklearn.cluster import KMeans
    
    np.random.seed(5)
    num_clusters = 25
    
    # Use scikit-learn's k-means to simplify workflow
    #kmeans_model = KMeans(n_clusters=num_clusters, n_init=5, max_iter=400, random_state=1, n_jobs=-1) # uncomment to use parallelism -- may break on your installation
    kmeans_model = KMeans(n_clusters=num_clusters, n_init=5, max_iter=400, random_state=1, n_jobs=1)
    kmeans_model.fit(tf_idf)
    centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_
    
    means = [centroid for centroid in centroids]
    
    CPU times: user 2min 44s, sys: 53.8 s, total: 3min 38s
    Wall time: 2min 44s
    
    In [59]:
    %%time 
    
    num_docs = tf_idf.shape[0]
    weights = []
    for i in range(num_clusters):
        # Compute the number of data points assigned to cluster i:
        num_assigned = np.sum(cluster_assignment == i) # YOUR CODE HERE
        w = float(num_assigned) / num_docs
        weights.append(w)
    
    CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
    Wall time: 1.23 ms
    
    In [60]:
    #cluster_assignment
    covs = []
    for i in range(num_clusters):
        member_rows = tf_idf[cluster_assignment==i]
        cov = (member_rows.multiply(member_rows) - 2*member_rows.dot(diag(means[i]))).sum(axis=0).A1 / member_rows.shape[0] \
              + means[i]**2
        cov[cov < 1e-8] = 1e-8
        covs.append(cov)
    
    In [61]:
    out = EM_for_high_dimension(tf_idf, means, covs, weights, cov_smoothing=1e-10)
    
    In [62]:
    out['loglik']
    
    Out[62]:
    [3855847476.7012835, 4844053202.46348, 4844053202.46348]
    In [23]:
    # Fill in the blanks
    def visualize_EM_clusters(tf_idf, means, covs, map_index_to_word):
        print('')
        print('==========================================================')
        
        num_clusters = len(means)
        for c in range(num_clusters):
            print('Cluster {0:d}: Largest mean parameters in cluster '.format(c))
            print('\n{0: <12}{1: <12}{2: <12}'.format('Word', 'Mean', 'Variance'))
            
            # The k'th element of sorted_word_ids should be the index of the word 
            # that has the k'th-largest value in the cluster mean. Hint: Use np.argsort().
            sorted_word_ids = np.argsort(-means[c])  # YOUR CODE HERE
    
            for i in sorted_word_ids[:5]:
                print('{0: <12}{1:<10.2e}{2:10.2e}'.format(map_index_to_word[i], 
                                                           means[c][i],
                                                           covs[c][i]))
            print('\n==========================================================')
    
    In [24]:
    '''By EM'''
    visualize_EM_clusters(tf_idf, out['means'], out['covs'], map_index_to_word)
    
    ==========================================================
    Cluster 0: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    minister    7.57e-02    7.42e-03
    election    5.89e-02    3.21e-03
    party       5.89e-02    2.61e-03
    liberal     2.93e-02    4.55e-03
    elected     2.91e-02    8.95e-04
    
    ==========================================================
    Cluster 1: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    film        1.76e-01    6.07e-03
    films       5.50e-02    2.97e-03
    festival    4.66e-02    3.60e-03
    feature     3.69e-02    1.81e-03
    directed    3.39e-02    2.22e-03
    
    ==========================================================
    Cluster 2: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    art         1.26e-01    6.83e-03
    museum      5.62e-02    7.27e-03
    gallery     3.65e-02    3.40e-03
    artist      3.61e-02    1.44e-03
    design      3.20e-02    4.59e-03
    
    ==========================================================
    Cluster 3: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    basketball  1.86e-01    7.78e-03
    nba         1.01e-01    1.22e-02
    points      6.25e-02    5.92e-03
    coach       5.57e-02    5.91e-03
    team        4.68e-02    1.30e-03
    
    ==========================================================
    Cluster 4: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    hockey      2.45e-01    1.64e-02
    nhl         1.56e-01    1.64e-02
    ice         6.40e-02    2.97e-03
    season      5.05e-02    2.52e-03
    league      4.31e-02    1.53e-03
    
    ==========================================================
    Cluster 5: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    republican  7.93e-02    5.20e-03
    senate      5.41e-02    6.28e-03
    house       4.64e-02    2.41e-03
    district    4.60e-02    2.37e-03
    democratic  4.46e-02    3.02e-03
    
    ==========================================================
    Cluster 6: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.60e-01    4.65e-03
    her         1.00e-01    3.14e-03
    miss        2.22e-02    7.76e-03
    women       1.43e-02    1.36e-03
    womens      1.21e-02    1.46e-03
    
    ==========================================================
    Cluster 7: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    championships7.78e-02    5.17e-03
    m           4.70e-02    7.58e-03
    olympics    4.69e-02    2.59e-03
    medal       4.28e-02    2.44e-03
    she         4.18e-02    5.99e-03
    
    ==========================================================
    Cluster 8: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    book        1.45e-02    9.38e-04
    published   1.23e-02    6.16e-04
    that        1.10e-02    1.73e-04
    novel       1.07e-02    1.43e-03
    he          1.04e-02    6.05e-05
    
    ==========================================================
    Cluster 9: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.37e-01    4.25e-03
    her         8.99e-02    2.74e-03
    actress     7.65e-02    4.29e-03
    film        5.98e-02    3.44e-03
    drama       5.03e-02    6.40e-03
    
    ==========================================================
    Cluster 10: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    soccer      1.15e-01    2.86e-02
    chess       4.52e-02    1.66e-02
    team        4.13e-02    2.15e-03
    coach       3.09e-02    4.45e-03
    league      3.07e-02    2.01e-03
    
    ==========================================================
    Cluster 11: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    president   2.52e-02    1.29e-03
    chairman    2.44e-02    1.97e-03
    committee   2.34e-02    2.38e-03
    served      2.24e-02    6.99e-04
    executive   2.15e-02    1.23e-03
    
    ==========================================================
    Cluster 12: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    music       7.26e-02    3.48e-03
    jazz        6.07e-02    1.14e-02
    hong        3.78e-02    9.92e-03
    kong        3.50e-02    8.64e-03
    chinese     3.12e-02    5.33e-03
    
    ==========================================================
    Cluster 13: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    university  3.47e-02    8.89e-04
    history     3.38e-02    2.81e-03
    philosophy  2.86e-02    5.35e-03
    professor   2.74e-02    1.08e-03
    studies     2.41e-02    1.95e-03
    
    ==========================================================
    Cluster 14: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    theatre     4.93e-02    6.17e-03
    actor       3.56e-02    2.91e-03
    television  3.21e-02    1.67e-03
    film        2.93e-02    1.16e-03
    comedy      2.86e-02    3.91e-03
    
    ==========================================================
    Cluster 15: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    album       6.76e-02    4.78e-03
    band        5.35e-02    4.21e-03
    music       4.18e-02    1.96e-03
    released    3.13e-02    1.11e-03
    song        2.50e-02    1.81e-03
    
    ==========================================================
    Cluster 16: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    tour        1.14e-01    1.92e-02
    pga         1.08e-01    2.65e-02
    racing      8.45e-02    8.26e-03
    championship6.27e-02    4.54e-03
    formula     6.06e-02    1.31e-02
    
    ==========================================================
    Cluster 17: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    news        5.76e-02    8.06e-03
    radio       5.18e-02    4.62e-03
    show        3.75e-02    2.56e-03
    bbc         3.63e-02    7.41e-03
    chef        3.27e-02    1.18e-02
    
    ==========================================================
    Cluster 18: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    football    1.11e-01    5.60e-03
    yards       7.37e-02    1.72e-02
    nfl         6.98e-02    9.15e-03
    coach       6.74e-02    7.85e-03
    quarterback 4.02e-02    7.16e-03
    
    ==========================================================
    Cluster 19: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    league      5.21e-02    3.13e-03
    club        5.04e-02    2.64e-03
    season      4.77e-02    2.30e-03
    rugby       4.35e-02    8.18e-03
    cup         4.22e-02    2.46e-03
    
    ==========================================================
    Cluster 20: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    orchestra   1.31e-01    1.06e-02
    music       1.23e-01    6.15e-03
    symphony    8.70e-02    1.08e-02
    conductor   8.16e-02    1.01e-02
    philharmonic4.96e-02    3.27e-03
    
    ==========================================================
    Cluster 21: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    law         9.52e-02    8.35e-03
    court       6.84e-02    5.24e-03
    judge       4.59e-02    4.44e-03
    attorney    3.74e-02    4.30e-03
    district    3.72e-02    4.20e-03
    
    ==========================================================
    Cluster 22: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    football    1.21e-01    6.14e-03
    afl         9.58e-02    1.31e-02
    australian  7.91e-02    1.58e-03
    club        5.93e-02    1.76e-03
    season      5.58e-02    1.83e-03
    
    ==========================================================
    Cluster 23: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    research    5.70e-02    2.68e-03
    science     3.50e-02    2.95e-03
    university  3.34e-02    7.14e-04
    professor   3.20e-02    1.26e-03
    physics     2.61e-02    5.43e-03
    
    ==========================================================
    Cluster 24: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    baseball    1.16e-01    5.57e-03
    league      1.03e-01    3.63e-03
    major       5.09e-02    1.19e-03
    games       4.66e-02    1.93e-03
    sox         4.55e-02    6.28e-03
    
    ==========================================================
    
    In [48]:
    np.random.seed(5) # See the note below to see why we set seed=5.
    num_clusters = len(means)
    num_docs, num_words = tf_idf.shape
    
    random_means = []
    random_covs = []
    random_weights = []
    
    for k in range(num_clusters):
        
        # Create a numpy array of length num_words with random normally distributed values.
        # Use the standard univariate normal distribution (mean 0, variance 1).
        # YOUR CODE HERE
        mean = np.random.normal(0, 1, num_words)
        
        # Create a numpy array of length num_words with random values uniformly distributed between 1 and 5.
        # YOUR CODE HERE
        cov = np.random.uniform(1, 5, num_words)
    
        # Initially give each cluster equal weight.
        # YOUR CODE HERE
        weight = 1 / num_clusters
        
        random_means.append(mean)
        random_covs.append(cov)
        random_weights.append(weight)
    
    In [49]:
    len(random_means), len(random_covs), len(random_weights), num_words, num_docs, num_clusters, random_means[0].shape
    
    Out[49]:
    (25, 25, 25, 100282, 5000, 25, (100282,))
    In [50]:
    out_random_init = EM_for_high_dimension(tf_idf, random_means, random_covs, random_weights, cov_smoothing=1e-5)
    
    In [51]:
    out_random_init['loglik']
    
    Out[51]:
    [-764086029.088758,
     2282599968.73394,
     2362197958.6081905,
     2362457265.2184424,
     2362457265.2187605,
     2362457265.2187605]
    In [52]:
    num_clusters
    
    Out[52]:
    25
    In [63]:
    out = EM_for_high_dimension(tf_idf, means, covs, weights, cov_smoothing=1e-5)
    
    In [66]:
    out['loglik']
    
    Out[66]:
    [3855847476.7012835, 2362779935.011491]
    In [67]:
    visualize_EM_clusters(tf_idf, out_random_init['means'], out_random_init['covs'], map_index_to_word)
    
    ==========================================================
    Cluster 0: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.24e-02    3.54e-03
    award       1.53e-02    1.18e-03
    music       1.45e-02    1.29e-03
    university  1.43e-02    6.32e-04
    law         1.27e-02    2.59e-03
    
    ==========================================================
    Cluster 1: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.99e-02    2.80e-03
    league      1.91e-02    2.14e-03
    season      1.59e-02    1.17e-03
    football    1.53e-02    2.15e-03
    he          1.36e-02    1.13e-04
    
    ==========================================================
    Cluster 2: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         5.65e-02    6.39e-03
    her         5.21e-02    5.62e-03
    music       1.13e-02    9.50e-04
    de          1.08e-02    1.92e-03
    opera       1.03e-02    3.26e-03
    
    ==========================================================
    Cluster 3: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    film        2.98e-02    5.79e-03
    he          1.38e-02    1.07e-04
    hockey      1.33e-02    5.33e-03
    she         1.31e-02    1.69e-03
    her         1.24e-02    1.20e-03
    
    ==========================================================
    Cluster 4: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.96e-02    3.21e-03
    film        1.38e-02    1.72e-03
    he          1.36e-02    1.14e-04
    her         1.35e-02    1.09e-03
    university  1.31e-02    5.36e-04
    
    ==========================================================
    Cluster 5: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.03e-02    2.65e-03
    her         1.37e-02    1.09e-03
    he          1.31e-02    1.21e-04
    law         9.93e-03    1.96e-03
    court       9.54e-03    1.47e-03
    
    ==========================================================
    Cluster 6: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.70e-02    3.62e-03
    her         1.44e-02    1.18e-03
    he          1.18e-02    1.06e-04
    served      1.11e-02    3.79e-04
    state       1.03e-02    5.08e-04
    
    ==========================================================
    Cluster 7: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         3.41e-02    4.33e-03
    her         2.17e-02    1.89e-03
    music       1.73e-02    2.16e-03
    album       1.52e-02    2.38e-03
    marathon    1.25e-02    5.52e-03
    
    ==========================================================
    Cluster 8: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    league      1.87e-02    2.01e-03
    she         1.73e-02    2.67e-03
    he          1.46e-02    1.17e-04
    season      1.38e-02    8.80e-04
    played      1.35e-02    6.32e-04
    
    ==========================================================
    Cluster 9: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         3.38e-02    4.81e-03
    her         1.52e-02    1.23e-03
    team        1.46e-02    8.86e-04
    played      1.45e-02    7.99e-04
    cup         1.39e-02    1.44e-03
    
    ==========================================================
    Cluster 10: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    music       1.54e-02    1.51e-03
    york        1.49e-02    8.13e-04
    he          1.31e-02    1.01e-04
    university  1.23e-02    5.72e-04
    she         1.22e-02    2.06e-03
    
    ==========================================================
    Cluster 11: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.07e-02    2.65e-03
    her         1.61e-02    1.67e-03
    film        1.55e-02    2.19e-03
    music       1.36e-02    1.71e-03
    university  1.36e-02    3.43e-04
    
    ==========================================================
    Cluster 12: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.06e-02    2.98e-03
    tour        1.46e-02    3.64e-03
    he          1.42e-02    1.22e-04
    her         1.26e-02    1.11e-03
    music       1.25e-02    1.40e-03
    
    ==========================================================
    Cluster 13: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.12e-02    2.25e-03
    her         1.65e-02    1.35e-03
    art         1.60e-02    2.76e-03
    nixon       1.50e-02    9.78e-03
    music       1.44e-02    1.41e-03
    
    ==========================================================
    Cluster 14: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    music       1.85e-02    2.31e-03
    film        1.69e-02    2.75e-03
    he          1.31e-02    9.45e-05
    research    1.08e-02    1.06e-03
    university  1.04e-02    2.96e-04
    
    ==========================================================
    Cluster 15: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.14e-02    3.05e-03
    he          1.35e-02    1.11e-04
    league      1.32e-02    1.34e-03
    her         1.30e-02    1.32e-03
    season      1.22e-02    1.01e-03
    
    ==========================================================
    Cluster 16: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.12e-02    2.73e-03
    her         1.65e-02    1.29e-03
    film        1.36e-02    1.26e-03
    he          1.17e-02    1.01e-04
    show        1.14e-02    9.72e-04
    
    ==========================================================
    Cluster 17: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    film        1.66e-02    3.18e-03
    music       1.28e-02    1.23e-03
    he          1.18e-02    9.96e-05
    she         1.11e-02    1.28e-03
    her         1.05e-02    1.17e-03
    
    ==========================================================
    Cluster 18: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         4.03e-02    5.17e-03
    her         2.28e-02    1.75e-03
    band        1.64e-02    2.43e-03
    music       1.24e-02    9.29e-04
    university  1.19e-02    3.84e-04
    
    ==========================================================
    Cluster 19: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    league      1.51e-02    1.26e-03
    season      1.43e-02    1.04e-03
    he          1.33e-02    9.20e-05
    she         1.33e-02    1.71e-03
    club        1.11e-02    8.05e-04
    
    ==========================================================
    Cluster 20: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.63e-02    3.91e-03
    film        1.82e-02    1.93e-03
    her         1.45e-02    1.43e-03
    he          1.26e-02    1.05e-04
    law         1.24e-02    2.35e-03
    
    ==========================================================
    Cluster 21: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.86e-02    2.57e-03
    he          1.39e-02    1.30e-04
    league      1.37e-02    1.27e-03
    season      1.22e-02    1.09e-03
    her         1.21e-02    1.09e-03
    
    ==========================================================
    Cluster 22: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         1.50e-02    2.21e-03
    music       1.49e-02    1.63e-03
    he          1.29e-02    1.08e-04
    party       1.12e-02    8.99e-04
    her         1.00e-02    1.08e-03
    
    ==========================================================
    Cluster 23: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.27e-02    2.67e-03
    her         1.79e-02    1.67e-03
    music       1.62e-02    1.32e-03
    band        1.20e-02    1.10e-03
    york        1.18e-02    6.68e-04
    
    ==========================================================
    Cluster 24: Largest mean parameters in cluster 
    
    Word        Mean        Variance    
    she         2.62e-02    3.89e-03
    her         1.77e-02    1.67e-03
    album       1.50e-02    2.47e-03
    he          1.28e-02    1.26e-04
    soccer      1.16e-02    4.19e-03
    
    ==========================================================
    
    In [ ]: